1
Abstract of “Improved Scientific Analysis through Domain Driven Visualization and Support for
Analytic Deliberation” by Radu Jianu, Ph.D., Brown University, May 2012.


   This dissertation introduces and evaluates novel visualization methods that enable researchers to
derive and test hypotheses from available scientific data faster and more accurately than before. Fol-
lowing the traditional visualization approach, we introduce novel ways of visualizing and interacting
with scientific data that support and accelerate researchers’ data analysis workflows. Following the
visual analytics path, which advocates for supporting the reasoning process itself, we quantify the
degree to which interface design elements can be used to unobtrusively guide researchers towards
applying verified and established analysis techniques in their research.
   We first present novel visualization methods that were developed in response to analytic needs in-
dentified through collaborative efforts in three concrete application areas. In neuroscience we enable
faster interaction with diffusion tensor imaging (DTI) datasets by creating planar representations
of the inherently 3D data. In proteomics we facilitate the visual collation of experimental data
and existing protein interaction information and accelerate the discovery process by uncovering and
supporting elements of the proteomic analysis workflow. In genomics we increase the accessibility
of analyzable visualizations of microarray data and eliminate the overhead of creating visualizations
and learning new systems by implementing and evaluating a novel data distribution method.
   Finally, we use the concepts of persuasive technology and “choice architecture” which state that
a user of a system can be unobtrusively guided towards behavioral patterns that are more efficient,
in terms of self-assumed goals, by slight alterations in the system interface. We provide quantitative
experimental support for the hypothesis that we can use subtle changes in the interfaces of visual
analysis systems to influence users’ analytic behavior and thus unobtrusively guide them towards
improved analytic strategies. We posit that this approach may facilitate the use of visual analytics
expertise to correct biases and heuristics documented in the cognitive science community.
Improved Scientific Analysis through Domain Driven Visualization and Support for Analytic
                                       Deliberation


                                            by
                                        Radu Jianu
                     B. S., Polytechnic University of Timisoara, 2005
                              Sc. M. Brown University, 2007


                   A dissertation submitted in partial fulfillment of the
                   requirements for the Degree of Doctor of Philosophy
               in the Department of Computer Science at Brown University


                                Providence, Rhode Island
                                        May 2012
c Copyright 2012 by Radu Jianu

             This dissertation by Radu Jianu is accepted in its present form by
       the Department of Computer Science as satisfying the dissertation requirement
                          for the degree of Doctor of Philosophy.


Date
                                              David H. Laidlaw, Director


                          Recommended to the Graduate Council


Date
                                          Odest Chadwicke Jenkins, Reader


Date
                                                   Ben J. Raphael, Reader


                            Approved by the Graduate Council


Date

                                             Dean of the Graduate School


                                             iii
Acknowledgements

I would like to thank the following people who have more or less indirectly contributed to the writing
of this dissertation.
   My advisor, David H. Laidlaw, for teaching me to think like a scientist, to find and solve problems
worth solving, and for providing a stable and nurturing environment in which I could develop both
as researcher and person. David will now continue to guide me in my academic career by serving
as a role model for good teaching, research, and advising. I cannot imagine a better person to have
worked with closely for so many years.
   My undergraduate advisor, Adrian Rusu, for instilling in me the desire to pursue a graduate
degree and an academic career, and for his initial guidance that made my years here at Brown
possible.
   The readers of this dissertation, Ben Raphael and Chad Jenkins, for their great feedback and
helpful comments.
   Cagatay Demiralp for being a great collaborator, co-author, friend and for not being too upset
about the way I spelled his name. My collaborators outside the department, in particular Arthur Sa-
lomon and Christophe Benoist, for helping me find and solve the interdisciplinary problems featured
in this dissertation.
   My very close Providence friends for helping me recharge my batteries between coursework, paper
submissions, and project deadlines. Thank you Misha, Wenjin, Aparna, Aggeliki, Babis and Olya
for many good times. Without your company, my time at Brown would have been unbearable (even
if perhaps shorter). At the same time I’d like to thank the Providence Tango community for helping
me forget about work during unforgettable dance nights.
   Lastly, but perhaps most important of all, I would like to thank my family: my wife, Doria,
and my parents. Doria, for joining me in this adventure, for putting up with late work nights,
and for being a true life companion. My parents, for being there whenever I needed them and for
continuously supporting my decisions.


                                                  iv
Contents

List of Tables                                                                                            ix

List of Figures                                                                                           x

1 Introduction                                                                                            1
   1.1   Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      1
   1.2   Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       3
         1.2.1   Visualization Contributions in Neuroscience . . . . . . . . . . . . . . . . . . .        3
         1.2.2   Visualization Contributions in Proteomics . . . . . . . . . . . . . . . . . . . .        4
         1.2.3   Visualization Contributions in Genomics . . . . . . . . . . . . . . . . . . . . .        4
         1.2.4   Improving Scientists’ Analytic Strategies through User Interface Changes . .             5
   1.3   Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        5
   1.4   Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     7

2 Planar Exploration and Analysis of 3D White Matter Tractograms                                          9
   2.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     11
         2.1.1   Visualizing and Interacting with DTI Datasets . . . . . . . . . . . . . . . . .          11
         2.1.2   Visualizing Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     13
         2.1.3   Coordinated Views for Visualization . . . . . . . . . . . . . . . . . . . . . . .        14
   2.2   Design Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      14
         2.2.1   A 3D Stream-Tube Visualization of DTI Data . . . . . . . . . . . . . . . . . .           15
         2.2.2   Similarity Between Fiber Tracts . . . . . . . . . . . . . . . . . . . . . . . . .        15
         2.2.3   Explicit Visualizations of Tract Similarity . . . . . . . . . . . . . . . . . . . .      15
         2.2.4   Hierarchically Projected Neural Paths . . . . . . . . . . . . . . . . . . . . . .        17
         2.2.5   A Multiple-Views System for Exploring DTI Datasets . . . . . . . . . . . . .             21
   2.3   Evaluation and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      21
         2.3.1   Anecdotal Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       21
         2.3.2   Quantitative User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      22
   2.4   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   24
   2.5   Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       26


                                                     v
3 Exploring and Analyzing Protein Networks                                                               27
  3.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     28
        3.1.1   Visualizing Signaling Pathways and Protein Interaction Networks . . . . . . .            28
        3.1.2   Visualizing and Exploring Networks . . . . . . . . . . . . . . . . . . . . . . .         30
        3.1.3   Focus and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      30
  3.2   Design Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      31
        3.2.1   Pathway Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . .        31
        3.2.2   Interaction Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     33
        3.2.3   Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      33
        3.2.4   Network Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       34
        3.2.5   Computing Protein Positions . . . . . . . . . . . . . . . . . . . . . . . . . . .        35
        3.2.6   Augmenting a Pathway Image with Dynamic Data . . . . . . . . . . . . . . .               36
        3.2.7   Exploring the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      36
        3.2.8   Visualization Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      37
        3.2.9   Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       37
  3.3   Evaluation and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      37
  3.4   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   41
        3.4.1   General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       41
        3.4.2   Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   42
        3.4.3   Focus and Context Exploration . . . . . . . . . . . . . . . . . . . . . . . . . .        43
  3.5   Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       44

4 A Map Inspired Framework for Accessible Data Visualization and Analysis                                45
  4.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     47
        4.1.1   Web Based Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        47
        4.1.2   Google Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      48
        4.1.3   Biomedical Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     48
        4.1.4   Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       48
        4.1.5   Genome Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        49
        4.1.6   Graphs and Protein Interaction Networks . . . . . . . . . . . . . . . . . . . .          49
  4.2   Design Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      49
        4.2.1   Example 1: Gene Co-Expression Map . . . . . . . . . . . . . . . . . . . . . .            50
        4.2.2   Example 2: Gene Expression Heatmaps . . . . . . . . . . . . . . . . . . . . .            51
        4.2.3   Example 3: Genome Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          52
        4.2.4   Example 4: Protein Interaction Networks . . . . . . . . . . . . . . . . . . . .          53
        4.2.5   Example 5: Planar DTI Tractography Maps . . . . . . . . . . . . . . . . . . .            57
        4.2.6   General Design Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        57
        4.2.7   Implementing Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       59
        4.2.8   Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        61
  4.3   Evaluation and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      62
                                                    vi
        4.3.1   Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        63
        4.3.2   Gene Co-Expression Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          63
        4.3.3   Gene Expression Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         64
        4.3.4   Genome Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        64
        4.3.5   Protein Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .        65
        4.3.6   DTI Brain Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        66
  4.4   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    66
        4.4.1   General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        66
        4.4.2   Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     67
  4.5   Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        68

5 Improving Scientists’ Analytic Strategies through User Interface Changes                                69
  5.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      71
        5.1.1   Limitations of Human Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .         72
        5.1.2   Guiding User’s Choices: Nudges and Persuasive Technology . . . . . . . . . .              73
        5.1.3   Supporting Analysis through Visual Analytics . . . . . . . . . . . . . . . . . .          74
  5.2   User Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       75
        5.2.1   Study Overview      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   75
        5.2.2   Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      76
        5.2.3   Analysis Interface and Evaluated Nudges . . . . . . . . . . . . . . . . . . . .           77
        5.2.4   User Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     79
        5.2.5   User Study Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        80
  5.3   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   81
        5.3.1   Data Preparation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .         81
        5.3.2   Quantitative Support for Nudging Hypothesis . . . . . . . . . . . . . . . . . .           81
        5.3.3   Qualitative Analysis of Subjects’ Workflows . . . . . . . . . . . . . . . . . . .         83
  5.4   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    85
        5.4.1   Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    85
        5.4.2   Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     85
        5.4.3   Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       85
        5.4.4   General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        86
  5.5   Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        87

6 Discussion and Conclusion                                                                               88
  6.1   Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     88
  6.2   Impact and Generality of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . .        89
  6.3   Discussion Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      90
  6.4   Open research opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       92
        6.4.1   Data Infrastructure for Distributing, Analyzing and Cross-Referencing Neu-
                rological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    92

                                                    vii
        6.4.2   Creating Tools for Analyzing Neurological Networks . . . . . . . . . . . . . .        92
        6.4.3   Automatically Suggesting Viable Hypotheses in Protein Pathway Analysis . .            93
        6.4.4   Cognitive and Domain Driven Analysis Tools and Visualizations . . . . . . .           93
  6.5   Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   95

Bibliography                                                                                          96


                                                  viii
List of Tables

 2.1   User performances on bundle selection task. . . . . . . . . . . . . . . . . . . . . . . .        23

 4.1   Number of tiles and disk space(MB) for the five visualizations with different image
       compression (PNG vs. JPG) and all tiles vs. non-empty tiles. First five rows stand
       for visualizations with 7 zoom levels; the last row corresponds to a 9 level genome
       browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   62


                                                   ix
List of Figures

 2.1   Coordinated DTI tractogram model exploration in lower dimensional visualizations:
       2D embedding (upper-right), hierarchical clustering (lower-left), and L*a*b* color em-
       bedder (lower-right). A selection of a fiber-bundle (red) in the hierarchical clustering
       is mirrored in the other views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   11
 2.2   An interactive analysis system using linked views and planar tract-bundle projec-
       tions. Three planar representations, along the coronal, transverse and sagittal planes
       (bottom panels), are linked to a 3D stream-tube model (upper left) and a 2D point
       embedding of tract similarities (upper right). Selections in the projection views can
       be performed by clicking or cutting across cluster curves and are mirrored in the 3D
       view. Points corresponding to the selected tracts are interactively embedded into the
       plane and used to refine selections at tract level. . . . . . . . . . . . . . . . . . . . .     12
 2.3   2D tract embedding for different spring force settings. a) Spring force with absolute
       distance displacement. b) Spring force with absolute distance displacement, weighted
       by decay function and with repulsive force. c) Spring force with relative distance
       displacement, weighted by decay function and with repulsive force. In c) clusters are
       tighter making selection and understanding of manifold recognition easier. . . . . . .          16
 2.4   Schematic tract-cluster representation. (Top) 2D projections of a tract-bundle, with
       an associated centroid curve (orange), are determined from a hierarchical clustering
       of initial 3D tracts. (Middle) The centroid curve is smoothed by a spline and the
       endpoints of non-centroid curves are clustered using their initial 3D coordinates (four
       clusters); for each cluster, three control points linking the center of the cluster to the
       centroid spline are computed. (Bottom) Splines are run from each curve endpoint
       through the control points of its corresponding cluster. . . . . . . . . . . . . . . . . .      18
 2.5   Depth ordering of 2D paths. For each segment of a 2D spline, we locate a correspond-
       ing segment on the 3D curve from which the spline was derived by traveling the same
       fractional distance along both curves. The depth of the 2D segment is the same as
       the depth of the middle of its corresponding 3D segment. . . . . . . . . . . . . . . .          19
 24


                                                   x
2.7   Comparing 2D embeddings for multiple tract distance measures. On the right, three
      types of distance measures were embedded: no end-point weight (top), weighted end-
      points (middle), Haussdorf (bottom). A few tract-points were selected. On the right,
      the corresponding 3D model is shown (top), together with the selected tracts in iso-
      lation from unselected ones (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . .        25

3.1   Analysis of a protein interaction network in the context of the T-cell pathway. Pro-
      teins and interactions dynamically extracted from the HPRD database (small fonts
      scattered between the protein icons in the pathway view . . . . . . . . . . . . . . . .          29
3.2   Structuring protein interactions around familiar canonical pathways provides intuitive
      visualizations. A canonical signaling pathway representation (top) can be imported
      into the system in two ways: on the lower left the pathway image itself is loaded
      into the system and preprocessed by circling proteins and drawing over interactions;
      the pathway features are then inferred from the user strokes and image features and
      shown here in black; or, on the lower right, protein and interaction icons are placed
      and dragged on an empty canvas to create a new pathway model. After positional
      assignment of each protein, the software aids in associating interaction database ac-
      cession numbers to each of the newly defined canonical pathway proteins. . . . . . .             32
3.3   Proteins and interactions from HPRD (small fonts) that are connected to the canon-
      ical pathway model are: (left) integrated directly into the signaling pathway image
      with one protein selected and its interactions highlighted; (right) structured around
      a user-constructed model; different classes of proteins have different appearances: ex-
      perimental proteins are colored yellow, kinases are drawn as hexagons and receptors
      as irregular stars; several experimental proteins are not known to be connected to the
      pathway and are therefore located in the lower right corner. HPRD proteins are placed
      in a structured manner between the pathway proteins based on their separation from
      the pathway proteins. (cutout) Disadvantage of simply drawing the network on top of
      the pathway image: HPRD interactions obscure elements of the canonical pathway;
      compare to the improved method (left) in which important pathway elements remain
      in the foreground.   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   34


                                                 xi
3.4   Exploration plane versus zoom-and-pan. (left) The network is explored in a separate
      plane showing only one protein and its interactors. Selecting an interactor changes
      the view of that particular protein via a smooth animation. This interaction network
      crawling method allows systematic discovery of connections among proteomic data
      and existing protein knowledge. Transparency keeps the global view visible and the
      same protein is highlighted within both planes. The protein layout in the exploration
      plane mimics the layout in the global plane, but is slightly distorted to achieve a more
      attractive representation. Changes in peptide abundance are represented as linear
      heatmaps. (right) Zooming and panning, while also available to explore the network,
      have several drawbacks: the view is cluttered, some interactors reach outside the
      viewing area, there is no space for additional details, and the global perspective is lost. 39

4.1   Five examples of digital map visualization (from left to right): gene co-expression and
      heatmap representations, a genome-viewer, a protein interaction network and a brain
      tractography projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   46
4.2   Co-expression map of 23k genes over 24 cell types of the B-cell family exemplifies map
      concept. The top view illustrates how maps are combined with client-side graphics:
      the map is at the center of the display while selecting genes by drawing an enclosing
      rectangle generates a heatmap on the right. Maps have multiple levels of zooming
      (bottom 2 rows), each with a potentially different representation. For example, genes
      are drawn as heatmap glyphs at the high zoom (lower right), and as dots at low zoom.
      Expression profiles of collocated genes are aggregated and displayed as yellow glyphs
      over the map. As zoom increases, expression profiles are computed for increasingly
      smaller regions. Interactions are not limited to zooming and panning; pop-up boxes
      link out to extra data sources, and selections of genes bring up a heat map (top panel). 50
4.3   A heatmap representation is displayed as a map, with gene and cell type axes im-
      plemented in Protovis attached on the right and at the bottom. The axes are linked
      to the map’s zooming and panning so that users can identify which genes and cells
      they are looking at. Selection of an area of interest prompts the highlighting of the
      corresponding cell types and genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    52
4.4   Gene expression data measurements over eight cell types of the entire mouse genome
      are mapped onto genome coordinates. The top view shows the general analysis frame-
      work as presented on the Immgen website; zoomed-in views appear at the bottom.
      Three types of visual queries can be performed, depending on the zoom. At an
      overview, lists of relevant genes can be highlighted using Google markers with custom
      icons - white lines with alpha gradients on each side marking regions with interesting
      expression characteristics. At an intermediate zoom (lower left), regions with similar
      expression can be identified: a blue low-expression region is visible at center right. At
      a zoomed-in level individual expression values and gene names can be identified. . .           54


                                                xii
4.5   Analysis of quantitative proteomic data in the context of a protein interaction net-
      work. The top panel shows an overview of the analysis setup. Time-course proteomic
      data is displayed on the lower left. The experimental protein selected in the list is
      highlighted on the map. A second protein was selected on the map and has its inter-
      actors and meta-information displayed. All instances of this protein are listed on the
      upper left, together with their interactors. Three additional zoom levels are shown on
      the lower row; as zoom level increases, less relevant proteins are added to the display.         55
4.6   DTI tractography data projected onto the sagittal, coronal and transverse planes.
      Major tract bundles are represented schematically by their centroid tract; individual
      tracts in bundles are linked from the centroid bundle to their projected end points.
      Zooming in allows access to smaller clusters of tracts. Bundles can be selected and
      pre-computed statistical data along with 3D views of the tract bundle (“brain view”)
      can be displayed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    58
4.7   Linked co-regulation maps of the T-cell (left) and B-cell (right) families. A selection
      in the T-cell map is reflected onto the B-cell map. A few groups of genes that are
      co-regulated in both cell families are noticeable by inspecting the upper part of the
      B-cell map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    61

5.1   By making subtle, non-functional changes in the interface of an analysis support
      module (top) we generated statistically significant changes in users’ analytic behavior
      in a visual problem-solving task. A first set of changes nudged subjects to increase
      their use of the analysis module by 39% (lower left, p = 0.02) in an attempt to support
      our subjects’ working memory. It also caused them to switch among hypotheses 19%
      more often (lower center, p = 0.03), indicating more consideration of alternative
      hypotheses. A second set of changes then led subjects to gather 26% more evidence
      per hypothesis (lower right, p = 0.01). These three increases compare to smaller or
      negative variations in a control group (+15%, −17%, −2%). . . . . . . . . . . . . . .            70
5.2   The two modified analysis interfaces include three evaluated nudges: a box listing
      online users actively interacting with the analysis module (left), a color gradient
      (white to gold) shows recently analyzed hypotheses (left), and a redesigned, larger,
      evidence box asks users to commit to the implications of a hypotheses not having
      associated evidence (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     78
5.3   Changes between the first two sessions (black) caused test subjects (square) to in-
      crease the number of hypotheses and evidence items entered into the analysis system
      by an additional 24% over the control subjects’(triangle) relative increase. The inter-
      face changes made before the third session did not have a significant impact on this
      performance measure (grey). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        82
5.4   Changes between the first two sessions caused test subjects (square) to increase their
      switching between hypotheses by an additional 35% over the control subjects’(triangle)
      relative increase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   82
                                                 xiii
5.5   Changes between the last two sessions (black) caused test subjects (square) to gather
      24% more evidence for their hypotheses as opposed to a constant evidence/hypotheses
      ratio (-2%) between all consecutive control sessions (triangle). Changes in the test
      group before the second session (gray) produced non-signficant changes in the evidence
      collection as compared to the control group. . . . . . . . . . . . . . . . . . . . . . . .   82


                                               xiv
Chapter 1

Introduction

1.1     Problem Statement
The aim of this dissertation is to further visualization research by introducing novel techniques that
let researchers hypothesize about their scientific data faster and more accurate than before. First,
we use domain driven design to create data visualizations that enable new or accelerate existing
data analysis workflows in three specific domains — neuroscience, proteomics, and genomics. This
allows researchers to extract insights from their data more efficiently. Second, we want to support
the process of distilling these insights into actionable hypotheses by quantifying the extent to which
interface elements can be used to align scientists’ analysis strategies to verified problem solving
techniques.

   Data visualization explores how to represent complex data graphically such as to maximize
the ability of the human visual system and intuition to extract meaning from it. Visualization is
therefore by definition tied to data, and a visualization’s value often judged by its ability to improve
the scientific workflows of researchers from different domains. This is why visualization perfectly
embodies Fred Brooks’ perspective on computer scientists as toolsmiths: “...hitching our research
to someone else’s driving problems, and solving those problems on the owners’ terms, leads us to
richer computer science research” [27]. In accordance with this vision, an important part of this
dissertation is dedicated to introducing and evaluating novel visual representation and interaction
techniques that help scientists understand their data better in a few concrete application areas:
neuroscience, proteomics, and genomics.
   However, visualization is a research area in its own right. Novel approaches that support the
core of the scientific analysis process and are generalizable across multiple domains have broad,
long-lasting impact and define visualization as a stand-alone research area. As such, this dissertation
combines visualization contributions that are specific to the above mentioned domains, with a general
approach that allows system designers to guide scientists, regardless of their domain and visualization
they use, towards improved analytic strategies. The three domain-specific contributions and the final

                                                   1
                                                                                                        2


general one correspond to four goals that in concert implement the dissertation’s aim of enabling
researchers to more efficiently derive and test hypotheses from their scientific data.
   The first three goals are approached following the traditional visualization method. We start by
interviewing domain experts and observing their workflows to identify shortcomings in tools they
commonly use to analyze their particular scientific data. We then introduce novel visualization
methods to alleviate these shortcomings. Finally, we demonstrate, through formal or informal
evaluations with our collaborators, that the developed methods let researchers understand the data
faster and with less effort than before.
   Using this methodology we first allow neuroscientists to more efficiently interact with and un-
derstand diffusion tensor imaging (DTI) datasets. We create abstract planar representations of DTI
datasets and link them to traditional 3D models of the same data. We show, both quantitatively and
qualitatively, that this accelerates interaction tasks that are at the heart of white matter analysis.
   Second, we enable proteomic researchers to relate their experimental data to the existing body
of knowledge with the hope of accelerating the process of discovery. To this end, experimental data
is visually collated with existing protein interaction information in ways that speak to proteomic re-
searchers’ intuition. Annecdotal feedback indicated that our methods accelerated previous workflows
for analyzing proteomic experimental data significantly.
   Third, we increase scientists’ access to visual representations of data and eliminate the overhead
of installing and learning new applications, and creating visualizations. Specifically, we present
a novel way of disseminating data as large pre-rendered visualizations distributed via the Google
Maps API. We evaluate this approach on genomic data but show how it can be generalized to other
domains. Annecdotal feedback showed that this simplified mode of data access prooves particularly
useful for exploring new datasets, in casual settings, and by users who are not computer savvy.
   Finally, we aim for an improvement of the scientific analysis processes, independent of particular
scientific fields and visualizations, by quantifying the degree to which interface design choices can lead
to improvements in analytic strategies used by researchers. This last goal extends the visual analytics
reasearch agenda which, amongst others, advocates for the improvement of the reasoning process
itself. Inspired by persuasive technology [64] and Sunstein and Thaler’s “nudge” concept [159],
which state that a system’s design can unobtrusively guide a user towards behavioral patterns that
are more efficient, in terms of self-assumed goals, we show how we can use subtle changes in the
interfaces of visual analysis systems to unobtrusively guide users towards analytic strategies that
cognitive science deems as more efficient.
   The dissertation exemplifies how visualization and interface design support scientific discovery
workflows from raw data interpretation to hypothesis elicitation and testing. The four goals that
compose this dissertation illustrate the generality continuum spanned while supporting scientific
data anlysis workflows in their entirety. Discovering efficient data representations and interactions
is often data and domain specific. Our first two goals (Chapters 2,3) demonstrate how a tight
collaboration with domain experts can lead to novel visual representations and interactions and
reveal shortcomings and improvements to existing ones. Making these representations available to
                                                                                                       3


scientists (e.g. stand-alone systems, web applications) represents an opportunity for research that
is less domain dependent (Chapter 4). Finally, hypothesis elicitation and testing, while sometimes
influenced by domain specific particularties and constraints, can often be thought of in abstract terms
using cognitive science and problem solving principles (Chapter 5). In concert, the four contribution
areas presented in this dissertation support a significant part of the scientific discovery process.


1.2     Overview and Contributions
The broad contributions of this dissertation are novel visualization methods that improve analysis
in proteomics, genomics and neuroscience and a better understanding of how we can leverage user
interface design to improve scientific data analysis. This section provides a short introduction to
these four application areas and a break-down of specific contributions.


1.2.1    Visualization Contributions in Neuroscience
White matter connects gray matter regions in the brain and is composed of bundles of myelinated
axons. White matter research has important clinical applications as changes in white matter are
associated with a wide range of neurological diseases. Diffusion Tensor Magnetic Resonance Imaging
(DTI) enables the in vivo reconstruction of white matter as collections of 3D integral curves that
reflect the paths of the white matter bundles. Due to the intricacy of the connectivity in the brain,
such 3D models are visually dense, making it difficult for practitioners to identify and interact with
anatomical and functional structures.
   To address this, starting from the inherently three dimensional data, we create simplified planar
representations that summarize important properties of the data but are more suited for interaction,
given their two-dimensionality. We then link these representations to traditional 3D curve models
such that all visualizations can be used in concert for a better understanding and manipulation of
the data. We demonstrate in a small quantitative user study that this synergy accelerates interaction
with white matter datasets. This work was published in [85, 88, 90, 70, 87, 89].
   Contributions:

   • Faster interaction with and better understanding of 3D models of white matter structures in
      the brain by linking them to two-dimensional abstractions defined from the same data

   • A novel two-dimensional representation of white matter data that has the desirable properties
      of low-dimensional representations (e.g visual clarity or ease of selection) while preserving
      anatomically meaningful coordinates

   • An evaluation that quantifies the benefits of augmenting a 3D white-matter model of DTI data
      with linked planar representations

   • Improved accessibility to white-matter visualizations for browsing purposes by disseminating
      them using the Google Maps API.
                                                                                                        4


   • A concrete visualization system for analyzing white-matter datasets


1.2.2     Visualization Contributions in Proteomics
Proteins within a cell interact to regulate the cell’s activity. Cascades of protein interactions peculiar
to specific cells or cellular outcomes are called signaling pathways. An in-depth understanding of
these pathways will let researchers discover efficient drugs that influence a cell’s behavior without
causing unwanted side-effects.
   Experimental data is an important component in understanding how signaling pathways func-
tion. However, to efficiently interpret experimental results, they need to be collated with existing
knowledge that can explain experimental observations and provide additional insight. One of the
most common such data types are protein-protein interactions stored in public databases.
   We present visualization techniques and design guidelines for combining interaction and experi-
mental data in ways that harness proteomic researchers’ intuitions. These include: integrating pro-
tein interaction networks into familiar signaling pathway images instead of drawing them as general
graphs; enabling interaction level analysis of dense networks; and driving exploration by compara-
tive analysis of multiple experimental datasets. An evaluation with domain experts motivates and
demonstrates the utility of this work. The results were published in [86, 122, 95, 180].
   Contributions:

   • Design guidelines for visualizing protein interaction networks and experimental proteomic data
       together

   • An anecdotal evaluation with domain experts revealing proteomic analysis workflows

   • A method for augmenting static pathway diagrams with dynamic interaction information

   .


1.2.3     Visualization Contributions in Genomics
Scientists today have access to many large datasets that describe biological processes. Advanced
systems for visualizing such data exist but have associated costs that depend on a scientist’s computer
abilities and familiarity with the data type and content. Thus, when handed unfamiliar datasets,
researchers often assess the time commitment these require and determine whether the analysis costs
are justified. This may lead to wasted analysis time or ignoring of potentially useful data.
   In this context we explore the benefits of using the Google Maps API, a pan-and-zoom interface
that is well supported and highly familiar, to distribute raw data along with pre-rendered visualiza-
tions derived from it. We use five concrete visualization examples to show that integrating Google
Maps with established visualization techniques offers a low-overhead way of exploring a dataset to
assess its relevance and facilitates lightweight analyses of datasets outside a researcher’s immedi-
ate focus. A collaborative design process and evaluation revealed that the pre-rendered browser
                                                                                                      5


approach works well in the genomic domain. We hypothesize that this distribution model may be
extended to other application areas as well. This work was published in [91, 92].
   Contributions:

   • The concept of disseminating genomic data as precomputed visualizations using the Google
      Maps API

   • An anecdotal evaluation showing the advantages and disadvantages of this approach

   • Five examples of specific visualizations and their evaluation with domain specialists

   • Design elements, challenges and opportunities when working with pre-computed visualizations
      and the Google Maps API


1.2.4     Improving Scientists’ Analytic Strategies through User Interface
          Changes
Ample cognitive science research revealed that human thinking is subject to heuristics and biases
that often lead to suboptimal problem solving [76]. A proposed solution in the fields of behavioral
economics and human-computer interaction relies on designing choice-layouts (i.e., how choices are
presented to consumers) and computer interfaces that “nudge” users towards decisions that increase
their chances to make choices in their own interest.
   We extend this “nudge” concept into the visualization domain by providing experimental support
for the hypothesis that subtle changes in user interfaces of visual analysis systems can unobtrusively
steer researchers away from cognitive biases and heuristics. Specifically, we report results from a
controlled study in which subjects were asked to complete three analysis sessions using a system
consisting of a visualization and an analysis support module. Two sets of non-functional changes
were made to the analysis support interface before the second and third sessions. These changes
were designed to improve three hypothesized or observed analytic deficiencies. Results of the study
show that the changes succeeded in alleviating the targeted deficiencies. This work was published
in [94, 93].
   Contributions:

   • An evaluation that quantifies the effect of design variation on analytic performance

   • Qualitative observations about users’ analytic strategies in a network analysis task


1.3      Background and Motivation
This section motivates the work presented in this dissertation by relating it to previous approaches
and findings in the fields of visualization, visual analytics and cognitive science. Subsequent chapters
will describe related work pertaining to each specific topic in more detail.
                                                                                                      6


   Visualization leverages the human visual system to enhance our ability to process large amounts
of data and to facilitate cognition. Visual representations of information allow analysts to perceive
patterns in the data, see data in context, and draw comparisons. In their anthology “Using Vision
To Think”, Card, Mackinlay, and Shneiderman [32] describe how visualization supports the process
of sensemaking, in which information is collected, organized, and analyzed to form new knowledge
and drive analysis.
   Visualization draws from the interplay between graphically represented data and a human’s
perception and analytical abilities and is as such intrinsically tied to the users it aims to help.
   In accordance with this view, an important part of this dissertation is dedicated to introduc-
ing novel visualization techniques and refining existing ones to help scientists in a few concrete
application areas interact with their data better than before. We found that feedback on exist-
ing visualization methods and analysis workflows from domain experts in neuroscience, proteomics
and genomics revealed important shortcomings that we could address. Specifically, we discovered
that the traditional mode of visualizing and interacting with data representing white matter in the
brain [19, 118] can be improved by providing alternative, abstract views of the same data alongside
the original representation. We found that simply drawing protein interaction networks as general
graphs, as it was done previously [82, 148], does not correspond to how proteomicists think of pro-
tein pathways. Finally, we found that when it comes to making visual representations accessible to
end-users, feature-rich and highly adaptive environments [113, 102, 152] are not necessarily optimal.
   As visualization is applied to increasingly complex problems and data, simply immersing the
scientist in a visualization with the hope of discovering the unexpected becomes unfeasible. The
primary product of visualization tools is insight [155, 150] defined by [141] as “an individual obser-
vation about the data by the user, a unit of discovery”. Insights can be used as evidence or in best
cases as hypotheses seeds, but they are rarely full fledged, testable hypotheses. Visualization can
thus be regarded as a support mechanism for generating evidence. This restrictive view on the role
of visualization implies that users are left to their own devices when aggregating scattered pieces of
information into high-level hypotheses.
   This should be regarded as a limitation because significant evidence from cognitive science sug-
gests that human thinking is subject to heuristics and biases that deviate from normative rationality
and lead to erroneous analysis. Such effects occur in all stages of analysis and many have not only
been demonstrated, but also quantified by controlled psychological studies: Wason’s 2-4-6 study [174]
shows that hypothesis confirmation is used instead of the normatively correct hypothesis disconfir-
mation; Simon [151] reveals “satisficing”, a heuristic that limits analysis to a hypothesis that is good
enough. Dunbar [54] shows that such effects hold in scientific analysis settings as well.
   Fortunately, evidence also suggests that reasoning can be improved by following rational recipes
for analysis and external aids that amplify cognition. For example, Dunbar [54] shows how bias
toward seeking confirming evidence can be overcome, and describes how analogy and unexpected
findings often lead to consideration of multiple hypotheses in scientific domains [55]; Savikhin et
al [144] uses a specific example from economic reasoning to prove that visualization can help overcome
                                                                                                        7


error-prone heuristics used in decision making.
   In the field of visualization these aspects have been recognized and investigated by visual analyt-
ics, a sub-field fueled by growing intelligence needs after 2001. Illuminating the Path [165] introduced
and defined this emerging field as “the science of analytical reasoning facilitated by interactive visual
interfaces”. Since its introduction visual analytics has advanced science with both theoretical and
applicative results. Examples of the former include a five-stage sense-making model [140, 23] derived
through Cognitive Task Analysis (CTA) or valuable insight into the workflows of collaborative sense
making [138, 84]. Representative of the latter are a plethora of applications that probe the feature
and design space of analysis-support software such as The Analyst’s Notebook [124], The Scalable
Reasoning System (SRS) [132] or Entity Workspace [22]. Most of these systems offer a large set of
evidence and hypothesis management features which is likely to increase the cognitive span of the
users and allow them to make associations that they couldn’t make before.
   The work presented in this dissertation complements such research by using a visual analytics
methodology to create a link between observed analytic deficiencies and corrected behavior. Al-
though cognitive biases and the need to leverage cognitive science expertise to alleviate them had
been recognized within the field [73, 158], few visual analytics attempts have been made to bridge
the gap between descriptive analysis (i.e., how humans actually analyze problems) and normative
analysis (i.e., rational strategies of analysis). Here, we test the hypothesis that careful design of fea-
tures included in a visualization system can unobtrusively guide users towards normatively correct
analysis.
   This approach was inspired by Thaler and Sunstein’s work on libertarian paternalism [164] and
the idea of “choice architecture”. The authors rely on the assumption that anyone who designs
how choices are presented is necessarily influencing decision-making behavior, and advocate for
designing choice structures that “nudge” users to make decisions in their best interests. We used
these concepts to demonstrate that interface design can be leveraged in a targeted way to guide
scientists towards using an analysis system more, to pursue multiple hypotheses in parallel, and to
gather more evidence per hypothesis.


1.4      Road Map
The document is broadly structured as follows. Chapters 2-4 share a similar structure and describe
visualization contributions in three specific areas: neuroscience, proteomics and genomics. Chapter
5 describes the results of a user study which tests whether interface nudges can be leveraged to guide
users towards rational analysis strategies. The dissertation ends with a concluding chapter. Below
is a detailed description of this structure.
   Chapters 2-4 describe visualization contributions made in neuroscience, proteomics and ge-
nomics and share a similar structure. Each chapter starts with an introduction to the problem,
motivation of its significance and overview of contributions. A detailed related work section that
contrasts the methods presented in this dissertation to previous approaches follows. Next are design
                                                                                                      8


choices and methods. Then we describe the evaluation procedure and the results obtained. Each
chapter ends with a discussion and conclusion.
   Chapter 2 demonstrates that linking 3D stream-tube models of DTI datasets to planar abstrac-
tions derived from the same data can accelerate interaction with and exploration of the datasets. It
also describes novel and effective planar representations, and an application that can be used for DTI
analysis. Chapter 3 introduces methods for collating publicly available protein interaction data with
experimental results in ways that support workflows and intuitions of proteomic researchers. Chap-
ter 4 shows how scientific data can be disseminated as a variety of pre-computed visualizations served
through Google-Maps and presents anecdotal feedback from domain experts that demonstrates the
usefulness of this approach.
   Chapter 5 describes how software interface elements can be used to correct analytic behavior
that is affected by cognitive biases and heuristics. We first motivate our approach by citing ample
cognitive science research and relating it to existing efforts in visualization and visual analytics. We
then describe a user study that tests our hypothesis and present its results. We end with a discussion
and conclusion.
   Chapter 6 concludes this dissertation with a reiteration of contributions, a statement on the
impact of the presented work, a few discussion points pertaining to the dissertation as a whole, a
description of potential future directions, and a short summary.
Chapter 2

Planar Exploration and Analysis of
3D White Matter Tractograms

Diffusion Tensor Magnetic Resonance Imaging (DTI) enables the exploration of fibrous tissues such
as brain white matter and muscles non-invasively in-vivo [18]. It exploits the fact that water in these
tissues diffuses at faster rates along fibers than orthogonal to them. Integral curves that estimate
fiber tracts by showing paths of fastest diffusion are among the most common information derived
from DTI volumes. Such curves are generated from DTI data by following the principal eigenvector
of the underlying diffusion tensor field bi-directionally and are commonly referred to as fiber tracts.
Sets of DTI tracts are known as tractograms and their study is called tractography.
   In this dissertation we discuss DTI visualization in the context of white matter tractography.
White matter in the brain ensures the connectivity between various regions of gray matter and is
composed of bundles of myelinated axons. White matter tractography has important applications
in both clinical and basic neuroscience research as lesions in white matter are associated to a wide
range of neurological diseases.
   DTI curves are often visualized as 3D models composed of streamlines or variations of streamlines
(streamtubes and hyperstreamlines) in 3D [117, 182]. Reflecting the intricacy of the connectivity in
the brain, these 3D models are generally visually dense and, with increasing DWI resolutions, this
complexity is bound to become greater. It is thus often difficult for practitioners to see tract projec-
tions clearly or identify anatomical and functional structures easily in these dense curve collections.
Typical interaction tasks over tracts, such as fine bundle selection, are also difficult to perform and
have been a focus of recent research [8, 7].
   In this context, we present a novel interaction paradigm and demonstrate both qualitatively and
quantitatively, that it can accelerate the exploration of white matter tractograms. Starting from
the inherently three dimensional data, we create abstract planar representations. These represen-
tations summarize important properties of the data and are suitable for interaction, given their
two-dimensionality. We link such abstract representations to traditional 3D tractogram models

                                                   9
                                                                                                    10


through interaction, such that operations performed in one of the views are mirrored into the oth-
ers. Users can thus create a mental mapping between the different modes of representation and use
them in concert for a better understanding and manipulation of their data.
   In a first developmental iteration, given a tractogram, we linked conventional 3D white matter
tractograms to a planar embedding and a hierarchical clustering tree (see Figure 2.1). Both 2D
visualizations are representations of a similarity matrix obtained by computing pairwise “distances”
which reflect the geometrical similarities between fiber tracts. The planar embedding is obtained by
considering each fiber tract to be an individual 2D point and placing it on a drawing canvas such that
the distances between points approximately reflect the distance relations between their corresponding
fiber tracts. The hierarchical clustering tree representation, or dendrogram, is obtained by applying
the average linkage hierarchical clustering algorithm on the similarity matrix. As shown in Figure 2.1,
these two abstract representations can be linked to a traditional 3D tractogram model implicitly
through interaction and explicitly through a perceptually uniform coloring.
   This first iteration was evaluated by interviewing experts and gathering feedback in an informal
setting. Results suggested that this type of coordinated interaction has the potential to enable faster
and more accurate manipulation of dense fiber tract collections. Work done concurrently on low
dimensional brain representations and described in [42] confirms our findings. The authors there
quantitatively compare linked views of low dimensional representations of DTI data and traditional
3D models to several state-of-the-art DTI visualization systems.
   The main drawback of such abstract representations, as found in our first evaluation session, is
that they lack an explicit anatomical interpretation. This means that little or no spatial correlation
can be found between the abstract views and the anatomical views. It is therefore challenging for
practitioners to create lasting mappings between abstract representations and their corresponding
3D tractograms, even in the presence of non-spatial links between the views (i.e., interaction and
color).
   Motivated by this problem, we introduce two-dimensional neural maps which package desirable
properties of low-dimensional representations into views that preserve meaningful anatomical coor-
dinates. Starting from a hierarchical clustering of a white matter tractogram and a given clustering
cut, bundles of tracts and their corresponding centroids are computed. These centroids are then
projected along the three principle projection planes: saggital, axial and coronal. The result is a set
of projected neural paths in the plane, similar to illustrations in medical textbooks. We link these
two-dimensional path projections to the original 3D white matter model as shown in the interactive
system illustrated in Figure 2.2.
   We assess the usefulness of neural path projections in two consecutive studies, one anecdotal and
one quantitative. Anecdotal study results indicate that this new representation is intuitive and easy
to use and learn. Results of the quantitative study show that users are faster and more confident
with the neural path projections than with traditional 3D interaction or with linked abstract planar
representations.
                                                                                                    11


Figure 2.1: Coordinated DTI tractogram model exploration in lower dimensional visualizations: 2D
embedding (upper-right), hierarchical clustering (lower-left), and L*a*b* color embedder (lower-
right). A selection of a fiber-bundle (red) in the hierarchical clustering is mirrored in the other
views.

2.1     Related Work
Here we discuss existing techniques that the present work builds on: techniques for visualizing and
interacting with DTI datasets, methods for visualizing similarity relations and the use of multiple,
coordinated views for visualization.


2.1.1    Visualizing and Interacting with DTI Datasets
The most commonly used technique to visualize DTI data is streamline tracing; in DTI-specific
literature this is also called fiber tracking [117] or tractography [19]. This method is used in our 3D
DTI visualization.
   Interacting with streamline DTI models is not trivial. A common interaction task is the selection
of fiber bundles. This is usually done directly on the model by placing 3D regions of interest (ROIs)
along the presumed path of the desired bundle and then having the application select fibers that
intersect those ROIs [33, 172, 112]. More recently, Akers et al. [7] introduced a sketching and gesture
interface for pathway selection: the user paints a 2D freehand stroke and the selection algorithm
selects tracts that cross the brush path. Finally, concurrent research by Chen et al. [37] also links
                                                                                                    12


Figure 2.2: An interactive analysis system using linked views and planar tract-bundle projections.
Three planar representations, along the coronal, transverse and sagittal planes (bottom panels), are
linked to a 3D stream-tube model (upper left) and a 2D point embedding of tract similarities (upper
right). Selections in the projection views can be performed by clicking or cutting across cluster
curves and are mirrored in the 3D view. Points corresponding to the selected tracts are interactively
embedded into the plane and used to refine selections at tract level.


2D embeddings to DTI datasets and finds that it accelerates interaction. The work presented in
this dissertation differs by incorporating hierarchical clustering trees, using perceptually variable
coloring to link views, and, most importantly introducing neural path projections as a novel type of
low dimensional representations that maintain an anatomical framework.
   Automatic DTI fiber clustering methods have been developed to support DTI model interaction
and visualization. For a review of such methods consult [116]. Fiber clustering relies on a similarity
metric that captures the geometric similarity between integral curves. For example, closest point
measures like the Hausdorff distance [41], and the Fr´echet distance [9] only measure the Euclidean
distance between two selected points on a pair of curves. Conversely, the average point-by-point
distance between corresponding segments defined in [51], the mean of closest distances defined in [41],
and the mean of thresholded closest distances defined in [182] summarize all points along two curves
as the mean Euclidean distance along their arc lengths.
   Fiber similarity can be mapped to color as was first done in [28] by assigning distinct colors to
clusters, and more recently in [48] by immersing a 3D embedding into the L*a*b* color space.
                                                                                                     13


2.1.2    Visualizing Similarity
Visualization literature describes several methods for conveying similarity relationships between
entities. Most of them have been researched in the context of multidimensional visualization, where
the distance is derived from the position of a point along each dimension. However, a subset of
these methods can be used for entities over which an arbitrary similarity function is specified. In
the following, we will only review this category. For a more detailed discussion on multi-dimensional
visualization techniques, Keim [100] provides a good overview.
   An intuitive way of making distance apparent is by using a scatterplot. In its simplest form
this method can only be used for data with at most three dimensions and explicit vector values.
Multi-dimensional scaling (MDS) techniques can overcome this limitation. They attempt to map
the multi-dimensional points to a visualizable lower dimension while preserving distance relations
between points.
   So called non-linear MDS methods are suited for computing representations when distances
between points are given explicitly but coordinate values for the points are unknown, as is the case
in the tract similarities computed as part of this research. These methods use the distance between
data points to define an error measure that quantifies the amount of distance information lost during
the embedding. Gradient descent or force simulation is then used to arrange the points in the low
dimensional space so as to minimize the error measure. A good example of such an approach is Force
Directed Placement (FDP) [68] originally proposed by Eades [57] as a graph drawing approach. It
simulates a system of masses connected by springs of lengths equal to the distances that need
to be embedded. The points are initially placed at random and are then iteratively moved by
displacements derived from forces computed by Hook’s spring laws. After a number of iterations
the spring system will reach a local minimum energy state that represents the resulting embedding.
We use this method as part of our work.
   An iteration of the original FDP model is O(n2 ), and since at least n iterations are necessary to
reach equilibrium, the final complexity is O(n3 ). This makes the computation for high-resolution,
complete brain models expensive. One method that addresses this problem is called Force Scheme.
Proposed by Tejada et al. [161], it reduces the overall complexity to O(n2 ) by requiring fewer
iterations to reach the final state. A complexity of O(n5/4 ) was achieved by Morrison et al. [119]
by creating a hybrid model based on approximations using samples and interpolations. In this
dissertation we use another algorithm, with linear iteration time, developed by Chalmers [34].
   A MDS can be used in conjunction with a perceptually uniform color space to display similar-
ity as a color cue. We use this technique to reflect the variation of tract similarity as a perceptual
variation of colors: similar tracts receive perceptually similar colors while dissimilar tracts get per-
ceptually distant colors. A color space is said to be perceptually uniform if the perceptual difference
between any two colors in just noticeable difference (JND) units is equal to the Euclidean distance
between the two colors in that color space. The L*a*b* color space is perceptually uniform and thus
a 2D or 3D embedding can be immersed into L*a*b* to obtain a similarity color coding. It should
be noted, however, that the perceptually uniformity in the L*a*b* is an empirical approximation
                                                                                                          14


and assumes a particular calibration setting for individual monitors.
   A dendrogram is another method for visualizing similarity that does not require explicit vector
values for points and as such is suited for displaying tract similarity. It is a tree-like visual representa-
tion of results produced by hierarchical agglomerative clustering algorithms ([15, 99]). Because they
are used in a wide range of scientific domains they have become intuitive tools for many scientists.


2.1.3     Coordinated Views for Visualization
Visualization techniques are usually task and data specific. Different views are therefore frequently
used to show data from multiple perspectives, combine the strengths of any individual technique,
and distribute the cognitive and interpretative load of complicated data and tasks across multiple
views [16].
   However, the task of aggregating the different views into a unitary single mental image factors in
the complexity of the visualization itself [16]. This effect can be reduced by coordinating the content,
appearance and behavior of the views [123]. This is achieved either implicitly, through coordinated
appearance or behavior, or explicitly through visual cues, such as color or lines linking the separate
windows. In this dissertation we use both approaches. Shneiderman [149] offers a good review on
multiple-view coordination techniques such as brushing and linking or details on demand.
   Multiple views applications are often used to aid in the understanding and exploration of compli-
cated datasets. In [30], the authors show several examples of how brushing and linking techniques can
be used to map a complicated data space into multiple simple views that, when explored together,
convey the overview data picture. Gresh et al. [74] present an approach that links 3D visualizations
to statistical representations to facilitate the effective exploration of medical data. XmdvTool [173]
and Visulab [145] attempt to maximize a user’s understanding of multidimensional data by linking
multiple representational techniques such as scatterplots, glyphs, or parallel coordinates. Finally,
work such as [129] and [26] propose domain independent, extensible multiple-view architectures that
satisfy general requirements of the visualization domain.


2.2      Design Elements
Here we describe methods underlying the DTI interactions described in this chapter. To reiterate,
these interactions build on the concept of linking traditional 3D visualizations of DTI datasets to
abstract planar representations derived from the same data.
   We first introduce the 3D stream-tube visualization which is used throughout the work. This
incorporates established techniques and interactions. We continue with several types of planar
abstractions that, when linked to traditional 3D tractograms, improve interaction and data under-
standing. All such abstractions are based on a geometrical similarity measure between 3D tracts,
which is discussed first. The different types of abstractions will be divided into two categories, dis-
cussed in separate sections that follow. First, explicit visualizations of tract similarity are based on
well established techniques but lack anatomical meaning. Second, a novel type of visual abstraction
                                                                                                   15


combines the strengths of low dimensional representation with a meaningful anatomical relation to
the original 3D data. Finally, we describe a multiple-views system that makes this novel interaction
paradigm available to neuroscientists.


2.2.1    A 3D Stream-Tube Visualization of DTI Data
Datasets used as part of this dissertation were 3D white matter tractograms text-formated as 3D
poly-curves. Details on how raw DTI datasets were acquired and how tractograms were derived
from them can be found in [85]. We display the tracts as 3D stream-tubes (see Figure 2.1).
   The following interaction modes are available on the 3D model. The 3D sphere selection tool
will select any tubes passing trough a sphere which the user can position with the mouse in the XY
plane and with the mouse-wheel along the depth dimension. The size of the sphere is adjustable.
Alternatively, the 2D brush tool allows a user to draw a freehand brush stroke over the 3D model
to select tracts whose screen projections intersect the brush stroke. Given a current selection, a
new set of selected tracts, generated with either the 2D brush or the 3D sphere, can be added to it,
removed from it or intersected with it. Finally, the following novel operation is implemented: once
a set of tracts selected, users can grow the selection by gradually adding tracts which are close to
the current selection.
   Selections performed on a specific brain model can be saved for future analysis. Moreover,
statistics such as average tube length, number of tubes or average fractional anisotropy can be
generated interactively on sets of selected tracts.


2.2.2    Similarity Between Fiber Tracts
The similarity between two tracts is quantified using the weighted chamfer distance discussed in [48].
This measure tries to capture how much any given two tracts follow a similar path, while giving more
weight to the points closer to tract ends. Distances between each pair of fiber tracts are computed
using λ = 0.5 as described in [48] and assembled to create a distance matrix. While this distance
measure is a good approximation of the notion of similarity in the domain, the methods described
in the following sections are independent of the particular distance measure used.


2.2.3    Explicit Visualizations of Tract Similarity
In the following three sections we describe three traditional ways of representing distance visually
and how they can be adapted to the particularities of DTI datasets.

2D Point Embeddings

Planar embeddings of tract similarity are computed using Eade’s [57] force directed method in
concert with Chalmer’s [34] acceleration technique which reduces the complexity of the computation
to O(n) by using a sampling strategy.
                                                                                                    16


              (a)                            (b)                                (c)

Figure 2.3: 2D tract embedding for different spring force settings. a) Spring force with absolute
distance displacement. b) Spring force with absolute distance displacement, weighted by decay
function and with repulsive force. c) Spring force with relative distance displacement, weighted by
decay function and with repulsive force. In c) clusters are tighter making selection and understanding
of manifold recognition easier.


   For the force computation we use Hook’s law F = −k∆X where ∆X is the spring displacement
and k is the spring constant. We experimented with variations of this force to obtain embeddings that
are better suited for neurotract interaction and analysis: a sharper definition of clusters can improve
bundle selection and manifold recognition while small distances should take embedding priority over
large distances. We tried the following approaches: using squared distance to exaggerate large
distances and make clusters more defined, using relative displacement instead of absolute difference
in distance to give larger distances more arrangement flexibility, and using a combination of weighting
forces with a factor inversely proportional to distance and adding a repulsive force between points.
As Figure 2.3 shows, good visual results were obtained by combining relative distance displacement,
forces weighted by a decay factor (e−σ/d with σ a decay factor and d the distance), and a repulsive
force (Frep = krep /d2embed , where krep is a constant and dembed is the embedded distance) between
all points.
   2D point interactions include point selection and point coloring. Selection is performed by clicking
and dragging; multiple selection can be performed to select points from non-adjacent regions. For
coloring, the 2D coordinates of the embedding can be interpreted as the (a, b) coordinates in the
L*a*b* color space, and, for a given luminance, colors can be attributed to points. The result is
that close points will receive perceptually close colors. However, this color embedding is not ideal
due to the particularities of the L*a*b* color gamut: it has an irregular shape and saturated colors
close to the boundaries. The 2D coordinates need to be scaled to fit into the gamut and will thus
occupy within the gamut a small, central region that corresponds to unsaturated colors.

A 3D Color Embedder

A better coloring can be obtained, as seen in Figure 2.1, by using a 3D color embedder. We compute
an approximation of the L*a*b* color gamut, as visible on the lower-right panel of Figure 2.1, and
use it as a container for force directed embedding. To avoid having to adjust a repulsive container
force, which would likely need a hard-to-control, steep gradient, we perform a physically accurate
                                                                                                     17


simulation with container contact detection. The embedding begins in the center of the gamut and
is gradually expanded until most of the space is filled. During implementation we observed that
the largest distances are often embedded along the luminance axis (y-axis of color gamut). This is
problematic because luminance offers little resolution and can be interpreted as a lighting effect. We
therefore apply a ”flattening” force at the beginning of a simulation cycle to force large distances to
lie in the horizontal plane. These force components, acting on the y-axis towards the center of the
gamut, wear off as the embedding moves towards a steady state. The force computation used is the
same as for the 2D embedding, with straightforward 3D modifications. In terms of interaction, the
color embedder only supports collapsing and color grabbing.

Dendrograms

Dendrograms are visual representations of hierarchical trees obtained through agglomerative cluster-
ing. We use an average linkage clustering whereby the distance between two clusters is computed as
the average of all inter-cluster distances. Minimum linkage does not give consistent results because
of so called “broken tracts” introduced by the fiber tracking algorithms short tubes placed between
major tract bundles will cause these bundles to be glued together by a minimum linkage algorithm.
   To compute the tree layout we use the method described in [136]: for each subtree the layouts for
the two child trees are computed recursively and placed next to each other aligned at the bottom;
the root is then placed one unit above their bounding box and in the middle of its horizontal axis.
For single node trees a unit bounding box is used.
   The following interactions are implemented for dendrograms: multiple node selections, collapsing
and expanding of individual nodes, or collapsing nodes automatically through cluster cuts.


2.2.4    Hierarchically Projected Neural Paths
2D point embeddings and dendrograms suffer from a major drawback as will be shown in the findings
section: they lack an anatomical interpretation. This section describes hierarchically projected
neural paths (see Figure 2.2) which is a type of representation that packages the strengths of abstract,
low dimensional representations in an anatomical framework. The following two sections describe
how hierarchically projected neural paths can be constructed.

Clustering and Projection

Hierarchically projected neural paths are schematic views of major tract bundles projected on a few
selected planes. In this work the sagittal, coronal and transverse planes were chosen as the main
modes of representation (see Figure 2.2).
   We first compute a clustering tree using an average-linkage hierarchical clustering algorithm on
the tract distance matrix (e.g., [53]). We choose the average-linkage criterion because it is less
sensitive than the minimum-linkage to broken tracts due to tracking errors. We obtain a clustering
of tracts by manually setting a cut threshold on the dendrogram. This threshold can be also
                                                                                                     18


Figure 2.4: Schematic tract-cluster representation. (Top) 2D projections of a tract-bundle, with
an associated centroid curve (orange), are determined from a hierarchical clustering of initial 3D
tracts. (Middle) The centroid curve is smoothed by a spline and the endpoints of non-centroid curves
are clustered using their initial 3D coordinates (four clusters); for each cluster, three control points
linking the center of the cluster to the centroid spline are computed. (Bottom) Splines are run from
each curve endpoint through the control points of its corresponding cluster.


interactively changed by users to control the coarseness of the clustering. A constant cut at 60% of
the clustering tree’s height gave consistent results across the six datasets we experimented with.
   Next, we create simple orthogonal projections of tracts on each plane. We cull out tracts that
do not contribute significantly to the projection. If the ratio of projected tract length to true tract
length is under a threshold value, we remove the tract from the corresponding cluster. We set the
culling threshold to 0.65 for the projections used in our experiments.
   Finally, we compute a centroid for each cluster by choosing the tract with the smallest maximum
distance to any other tract in the cluster. We found that for illustration purposes it is desirable to
avoid broken tracts. We therefore weigh the centroid selection to favor longer tracts by dividing the
maximum distance from each tract to any other tract by the tract’s length.
                                                                                                     19


Figure 2.5: Depth ordering of 2D paths. For each segment of a 2D spline, we locate a corresponding
segment on the 3D curve from which the spline was derived by traveling the same fractional distance
along both curves. The depth of the 2D segment is the same as the depth of the middle of its
corresponding 3D segment.


Visual Representation

We opted for an illustrative rendering of brain projections. Illustrative visualization uses abstraction
to reveal structure in dense visualizations and to harness scientists’ familiarity with textbook repre-
sentations [170]. Both criteria apply to white matter tractograms: fiber bundles provide a natural
abstraction of 3D anatomy that avoids the clutter of large streamtube collections, while textbook
illustrations [72] shape the intuition of neuroscientists. These advantages have also been recognized
and explored by Otten et al [127].
   The rendering assumes a given clustering with assigned centroid tracts, which can be computed
as described in the previous section. Our approach is inspired by Holten’s hierarchical edge bun-
dles [81] in attempting to group all fiber-tracts from a bundle into one, visually salient structure.
However, hierarchical edge bundles operate on abstract connections that are unconstrained by con-
crete geometrical shapes. They can therefore be drawn according to visual aesthetics principles
alone. Conversely, fiber tract paths play important anatomical functions and should be preserved
in tractogram visualizations. To this end, we perform our bundling by routing tracts along the path
of the most representative tract in their bundle. Thus, the centroid tracts will define a schematic
neural skeleton on top of which the non-centroid tracts are scaffolded.
                                                                                                     20


   Projections of centroid curves are smoothed prior to rendering to achieve a schematic represen-
tation and to reduce clutter. This is done by sampling a number of evenly distributed control points
(five in our implementation) along the tract projection and using them as control points for a spline.
In our implementation the spline is piecewise cubic and consists of 30 segments. The thickness of a
centroid curve is proportional to the square root of the number of tracts in the bundle.
   Once centroid tracts are represented as 2D splines, endpoints of non-centroid curves are linked to
their cluster’s centroid spline following the procedure illustrated in Figure 2.4. First, the end-points
of non-centroid curves in a bundle are clustered based on the end-points’ initial 3D coordinates.
Two endpoints are placed in the same cluster if the distance between them is less than 2 mm. Then,
for each such endpoint cluster we compute three control points that link the geometrical center of
the endpoint cluster to the centroid spline: the first point is the center itself, the second is a point
on the centroid spline closest to the center point, and the third is determined by traveling from the
second point down the centroid spline, towards each curve’s other endpoint, for a predefined distance
(e.g., half of the distance between the first two points). Ultimately, splines are run from each tract
endpoint through its cluster’s three control points, thus linking each endpoint to the centroid path.
The thickness of these endpoint linkage splines gradually increases from unit thickness (i.e., single-
tract thickness) at the tract endpoint to a thickness proportional to the square root of the endpoint
cluster size, where it merges with the centroid spline.
   We depth-order spline segments so that 2D centroid splines crossings can indicate the depth
ordering of their corresponding 3D shapes. The depth ordering is done differently for centroid
splines and non-centroid splines, since while centroid curves are close representations of actual 3D
tracts, non-centroid curves are abstract representations obtained through the process described
above. Furthermore, the depth ordering is approximate (as discussed in the following paragraph)
and may produce artifacts.
   For centroid splines, the depth of a spline segment is computed by finding a matching segment
on the 3D tract from which the spline was derived, and taking the depth of that segment’s center
(Figure 2.5). The matching segment on the 3D tract has its endpoints at the same fractional distance
from the start of the 3D tract as the 2D segment’s distance from the start of the 2D spline. This per-
segment ordering was chosen because of the intricacy of white matter tractograms. Tracts often wrap
around each other such that a correct per-tract depth ordering cannot be determined. Treating each
curve segment independently maximizes the probability that the 2D rendering remains truthful, at
least within the resolution of the tract segmentation. Conversely, non-centroid splines are completely
abstract 2D representations. The depth of any non-centroid spline is determined by averaging the
depth of the corresponding 3D tract. Wrapping fiber tracts are therefore not captured by this latter
process.
   Finally, bundle-color, texture or thickness can be used as additional depth cues. While we have
not fully integrated and evaluated such encodings in our current prototypes, we have experimented
with color cues and found those to be useful.
   In the following section, we give details on how we use 2D neural path representations as part of
                                                                                                   21


an interactive application and as standalone digital maps.


2.2.5     A Multiple-Views System for Exploring DTI Datasets
Using QT GUI and G3D graphics libraries we created a framework that allows for any of the
previously described visualizations to be linked together. Figure 2.2 shows the application with a
traditional 3D stream-tube view linked to three path projections. Operations performed in one view
are mirrored in all other linked views. For example, selecting points in the 2D embedding will result
in a selection in the brain model, while color grabbing in the 3D color embedder will cause tracts to
receive the corresponding coloring information (see Figure 2.1).
   In terms of the system’s implementation, following an interaction or any change in its state,
a visualization can broadcast a message that informs linked views that one of its properties has
changed. Linked visualizations, depending on their implementation, can either act on such a message
or ignore it.
   We have recently also developed a digital map interface that coupled the projected path repre-
sentations with the Google Maps API to enable web-accessible, ligtht-weight visualizations of DTI
data. This mode of distribution is described in Chapter 4.


2.3      Evaluation and Findings
We evaluated the methods both anecdotally, by interviewing domain experts, and quantitatively by
measuring subjects’ bundle selection times as part of a formal user study. The results show that
while planar abstractions in general are likely to accelerate the exploration of DTI tractograms,
the hierarchical path projections introduced in section 2.2.4 offer the most significant improvement
due to their anatomically grounded representation. Below we detail the evaluation procedures and
results for both the anecdotal evaluation and the quantitative users study.


2.3.1     Anecdotal Evaluation
In a first evaluation we gathered feedback about the value of linking the explicit planar abstractions
presented in section 2.2.3 to traditional 3D models. We showed our prototype to a group of experts,
including one research neuropsychiatrist and three neuropsychologists. They were all interested
in the relationship between fiber tracts and cognitive and behavioral function in the brain and
have either seen or interacted with streamtube representations of fiber tracts before. A think-aloud
protocol was used; we demonstrated the prototype using a projector while asking questions and
collecting their feedback.
   The experts agreed that the proposed paradigm can supplement the existing tools and would be
particularly useful in accelerating the selection of tract bundles. They found the coloring method
to be helpful and visually appealing, which was argued to be an important factor for adoption of
a visualization tool. They found the hierarchical clustering tree to be more useful than the 2D
                                                                                                   22


scatterplot representation. One interaction scenario proposed was to select a rough region in the
brain model using sphere selection and then gradually refine it in the hierarchical clustering tree.
   This feedback was backed by concurrent research on linking planar representations to DTI
datasets presented in [37]. The authors ran a quantitative study and find that a system which
linked a scatterplot representation to a 3D tractogram lead to lower selection times of major tract
bundles, as compared to several other leading DTI visualization systems.
   On the downside there was concern that learning the correspondence between the 2D point-cloud
representation and the actual fiber-tract collection can be non-trivial. Following this feedback the
hierarchical path projections described in section 2.2.4 were developed. The goal was to package the
benefits of dimensionality reduction techniques in an anatomically valid representation.
   This new representation mode was evaluated in a second anecdotal study. Three neurosci-
entists took part in an informal evaluation: we demonstrated the prototype while asking questions
and collecting participant’s feedback. Two of the experts also tried both interfaces themselves by
selecting a set of major bundles: the CC, cingulate bundle, uncinate anterior internal capsule, and
the corticospinal tract. There was agreement that our new interface was significantly more intuitive
and easier to use and learn than the abstract low dimensional representation.


2.3.2        Quantitative User Study
As noted in the previous section, authors in in [37] compare a system that links a 3D stream-tube
model to a standard 2D point representation to several state-of-the art DTI visualization systems
that don’t employ linked planar views. It was therefore judged to be a valid baseline to compare
the improved projected path representation against. Their interaction consists of a brush tool that
works similarly to ours in 3D and as a lasso tool on the 2D point representation. Users are able to
select tracts or points and then remove them or, conversely, remove everything else from a current
selection.
   The user study involved four subjects with general neuroanatomy knowledge and all had some
experience with tractography visualization tools. The first subject was a neuroscience graduate
student who had previously used Diffusion ToolKit (DTK) for six months. The second subject had
five years experience in diffusion MRI clinical research and had used BrainApp, Slicer and TrackVis.
The third subject was a biomedical engineering graduate student and had significant tract selection
experience using BrainApp. Our last subject was a computer science graduate who was developing
automatic algorithms for white matter analysis. Two of the users were male and two female. All
users were right-handed.
   The user study involved the timed selection of three major bundles in two distinct datasets,
using the two systems. The three targeted bundles were the bilateral cingulate bundle, the bilateral
corticospinal tract, and the right superior longitudinal fasciculus. The order in which the systems
were used was alternated: two of the subjects started with the projected path representation while
the other two were asked to first use the 2D point embedding system. For each system, users were
first given a brief description of the underlying visualization concepts and were shown a brief demo.
                                                                                                     23


                                       time (secs)                confidence
                                cb     cs      slf   mean   cb    cst   slf    mean
                   2D point    227     361     234   274    4.1   3.3   3.1    3.5
                   2D path     136     165     215   172    4.1   3.8   3.7    3.9

                      Table 2.1: User performances on bundle selection task.


They were then trained on the same three bundles as they would use in the real task, but on two
different datasets. Following training, they were asked to select the three bundles while being timed
on each selection. After each selection they also provided a five point subjective confidence estimate
for their selection. This methodology aimed to capture users’ performance once they have already
developed strategies for bundle selections, and as such more closely model what would happen
over extended use. After completing the task on both systems, subjects were asked to complete a
post-questionnaire in which they provided qualitative feedback on their experience.
   Results from the quantitative study are conclusive. The average times and confidence measures
for each user, over all datasets and tract bundles are shown in Figure 2.6 using a paired t-test
(projected path measures subtracted from 2D points measures). As seen, there is a significant drop
in selection time using the novel projected path representation. Results are less conclusive for the
subjective confidence measure, with two of the differences lying within standard error. Table 2.1
summarizes users’ overall and per-bundle mean performances on each tool.
   In addition to the quantitative measurement, by observing our users’ actions, we distinguished
several typical behaviors. Two distinct selection strategies were used in the projected path visual-
ization. Two of the users consistently brushed over large areas of the projection to ensure that the
targeted bundle was selected and then relied on the 3D view to clean up the selection. The other two
users aimed for fine selections in the 2D projections and then inspected the 3D view to determine
whether any fibers escaped the selection. They added the missing tracts using short, targeted brush
strokes and then removed tubes that were erroneously added during this operation. These users
seemed to have a better understanding of the mapping between the 3D view and the 2D projections
which perhaps explains the difference in strategies.
   All subjects used the 2D point representation relatively rarely. The most common operation was
to remove points they were completely confident were not part of the selection (e.g half of the brain,
peripheral U-shaped bundles). However, as one of the users noted, in absence of a clear mapping
between the views subjects were hesitant to perform bold operations in 2D. This confirms concerns
expressed in the first anecdotal evaluation.
   In a few cases users took significantly longer on a single task than other users for the same dataset
and tract. This outlier effect happened when users switched to a ”rigorous” refinement mode and
lost track of time. Interestingly enough a ”rigorous” refinement mode did usually not result in a
better subjective confidence rating.
                                                                                                     24


                160                                             0

                140
                                                              -0.1
                120

                                                              -0.2
                100                         user1
                                                                                        user1
                                            user2
                                                                                        user2
                 80                                           -0.3
         me(secs)                          user3   subjec ve                           user3
                                            user4   conﬁdence
                 60                                                                     user4
                                            mean              -0.4
                                                                                        mean
                 40
                                                              -0.5
                 20

                  0                                           -0.6
                             (a)                                       (b)

Figure 2.6: Per-user differences between (a) time and (b) confidence measurements on the two
tools. Differences are obtained by subtracting 2D path tool performance values from 2D point tool
performance values. Red squares show the mean performance difference between the tools. Errors
bar around the red squares indicate the standard error of per-user differences.


2.4     Discussion
Each dimension in a visualization comes with extra cognitive and perceptual load. While there
are clear advantages to three-dimensional visualizations in some contexts, previous work shows that
humans are better at understanding two-dimensional representations [40, 143]. Beyond reducing
cognitive and perceptual load, dimensionality reduction techniques have been popular in data mining
because the “intrinsic dimensionality” of data is often much lower than the dimension of the space
where data is immersed. In this context, it is not difficult to imagine fiber tracts as points on a
low-dimensional manifold sitting in a high-dimensional space, particularly when we consider fiber
tracts’ locally continuous and smooth variation in the brain. So, low-dimensional representations
can go beyond being interaction gadgets and provide new “windows” into the intrinsic structure of
data.
   For instance, while clearly being inferior to hierarchically projected paths for some tasks, the
2D point embeddings can be used for a comparative analysis of multiple distance measures. In
Figure 2.7 three different embeddings computed for the following distance measures are shown: the
one proposed in [48], the one used in this dissertation, and Haussdorf distance. The embeddings
were linked to the model they were derived from. By selecting points in either embedding, changes
in relationships are highlighted in the others, while the corresponding tracts are displayed in a fourth
window. Figure 2.7 illustrates how a fiber bundle is embedded depending on the particular distance
measure: the first type of measure uses only tract curvature and will thus place the three tracts
                                                                                                   25


Figure 2.7: Comparing 2D embeddings for multiple tract distance measures. On the right, three
types of distance measures were embedded: no end-point weight (top), weighted end-points (middle),
Haussdorf (bottom). A few tract-points were selected. On the right, the corresponding 3D model is
shown (top), together with the selected tracts in isolation from unselected ones (bottom).


that deviate from the bundle path further apart from the rest; the second measure adds weight to
tract endpoints and as such ignores the bend in the three tracts and places all of them into the same
cluster; finally, the Haussdorf distance considers only the minimum point-to-point distance and will
thus place the tract-points together but also in the vicinity of other tracts that, while having close
individual points, don’t necessarily display any curvature similarity.
   As shown, the explicit distance visualizations have the drawback of lacking anatomical interpre-
tation. This might be alleviated by incorporating abstract representations of anatomical landmarks
into the representations. For instance, points that correspond to tracts which come close to the
brain ventricle can be highlighted in the planar representations. Alternatively, the ventricle could
be approximated by a set of fictional tracts that could be projected along with the rest of the tracts
but represented differently. While these methods could alleviate some of the mapping problems, the
projected path representations are likely to maintain a significant advantage.
   Finally, the 2D neural path representation uses simple heuristics but relies heavily on the quality
                                                                                                      26


of fiber tracking, distance measure and clustering. All of these factors can reduce the esthetic and
functional success of the representation. While this can be seen as a limitation it also means that
progress in any of these techniques will result in improved projected path representations.


2.5     Concluding Remarks
A new method for visualizing and navigating through tractography data, combining two-dimensional
representations of fiber tracts with streamtube models was presented. Based on the geometrical
similarity between tubes, planar abstractions were created from the tractographic data: a 2D point
embedding, a dendrogram, and a novel visualization based on projecting major bundles onto three
projection planes. These planar representations are linked to traditional 3D stream-tube visualiza-
tions of white matter tractograms through interaction and color. Two anecdotal and one quantitative
evaluations show that such planar abstractions can improve data understanding and interaction in
general, but are most effective if they preserve anatomical features such as in the case of the projected
bundles technique.
Chapter 3

Exploring and Analyzing Protein
Networks

Proteins within a cell interact with one another in order to regulate the cell’s activity. The nature
of these interactions is diverse. Among others, an external event can be transmitted to the inside
of a cell through interactions of signaling molecules; a protein binds to another protein to alter its
function; or a protein will carry another protein to a specific cell location.
   A cascade of protein interactions peculiar to a specific cell, stimulation, or cellular outcome
is called a signaling pathway. An in-depth understanding of these pathways will, among other
outcomes, let researchers discover efficient drugs that can influence a cell’s behavior without causing
unwanted side-effects.
   Experimental data is an important component that researchers use to understand how signaling
pathways function. For instance, researchers can artificially stimulate a cell and measure how the
proteins within it respond, possibly over a series of time-points. To efficiently interpret the results
of such experiments, they need to be collated with existing knowledge that can explain some of the
observations and provide valuable insights for hypothesis generation. One of the most common such
data used in signaling pathway analysis are protein-protein interactions extracted from proteomic
publications and stored in online databases.
   Advances in proteomic experimental techniques and improved analytical methods have enabled
researchers to produce vast quantities of experimental data. Combining it with the sheer complexity
of protein interaction networks increases the information space even more. Thus, thinking about the
data at its original low level has become impractical. New computational techniques are required that
either extract relevant information automatically or let researchers process data faster by looking at
condensed visual representations.
   This necessity has been acknowledged by the research community and analysis frameworks that
build on traditional graph drawing to visualize protein interaction networks have emerged. However,
findings presented in this chapter, as well as results from more recent work, suggest that additional

                                                   27
                                                                                                    28


research is needed to ensure that the visualization methods employed are adequate for proteomic
research.
   This chapter presents a design study on several novel visual and interaction paradigms for the
analysis of quantitative proteomic data, canonical signaling pathway models, and protein interac-
tion networks along with the proteomic analysis requirements that motivated them. The methods
are evaluated anecdotally with domain experts to determine their overall ability to accelerate the
proteomic discovery process.
   The methods described are general and discussed in terms of their benefits as components of
established protein networks analysis applications such as Cytoscape. However, for concrete exem-
plification, they will occasionally be framed in the context of the testbed application used to develop
and evaluate them.
   Figure 3.1 illustrates the main visualization and interaction paradigms presented in the paper:
harnessing the researcher’s existing mental schema and intuition by integrating dynamic interaction
data into static but familiar signaling pathway images provided by the user; enabling proteomic
specific interaction level analysis of dense networks by integrating a novel Focus+Context technique;
and driving exploration by comparative analysis of multiple experimental datasets.
   First, we relate results to previous work in protein interaction network visualization and related
techniques. We then introduce the methods by presenting an overview of the visualization workflow
and then detailing each of its components. We then present our results as findings and evaluations
of how the techniques improve the proteomic workflow. A discussion of design choices follows. A
distillation of the findings concludes the chapter.


3.1     Related Work
Here we present a few related work topics pertaining to this research: visualizing protein interaction
networks in particular and networks in general, and focus+contex visual exploration methods.


3.1.1       Visualizing Signaling Pathways and Protein Interaction Networks
The first representations of protein interaction networks had the form of static, schematic drawings
of signaling pathways. Several papers such as [25, 114] discuss guidelines and approaches to drawing
such representations. However, the static nature and manual assembly became serious limiting
factors when protein-protein interaction databases were first created – researchers needed a way to
generate visualizations on the fly based on database queries.
   Many popular protein interaction databases – examples include the Human Protein Reference
Database (HPRD) [131], Molecular INTeractions Database [181], STRING [171], and the Database
of Interacting Proteins [176] – started to provide on their websites visual components that let
users navigate the protein interaction space. Most of these visualizations represent protein-protein
interactions via a node-link paradigm and produce visual layouts with spring models or other force-
directed methods. Recently, more advanced standalone visualization systems have emerged; notable
                                                                                                      29


Figure 3.1: Analysis of a protein interaction network in the context of the T-cell pathway. Proteins
and interactions dynamically extracted from the HPRD database (small fonts scattered between the
protein icons in the pathway view
   ) are integrated directly into an imported image of a canonical signaling pathway. Heatmaps
representing quantitative data from multiple experiments appear on the right and are used to drive
   analysis. Focus+Context is implemented as a semitransparent plane hovering over the global
image and allows researchers to navigate through complex networks in a one-level-at-a-time mode.


among them, Cytoscape [148] and VisANT [82] offer multiple representation methods, session-
saving capabilities, and numerous features for pathway analysis. Moreover, users can add features
and customize the software using plugin architectures.
   Nevertheless, aspects of these visualization systems can still be improved. For instance, using
generic techniques devised by the graph-drawing community sometimes yields visualizations that
are far from intuitive to proteomic researchers, since their failure to incorporate protein cellular
location and signaling pathway drawing conventions detracts from the visualization’s familiarity.
This problem is also recognized by [17] and [98].
   Another topic not sufficiently investigated is the integration into protein interaction visualizations
of quantitative data from large-scale proteomics experiments. Cytoscape uses a flexible plug-in
architecture to address this and other functionality needs; other systems simply let one load textual
annotations onto a protein network. The visual display, analysis, and comparison of results from
multiple quantitative proteomic experiments are still an area of active research. The most recent
work identifying and addressing the issues of both layout and experimental data is [17]. It extends
Cytoscape with a new protein network layout algorithm that organizes proteins in cellular layers,
                                                                                                     30


based on an annotation file supplied by the user. Quantitative data can be loaded and viewed as
color mappings on the proteins. Multiple experimental conditions are shown using small multiples
(i.e., multiple iconic representations of the protein network for each experimental condition) and a
parallel coordinate view. The work presented in this chapter differs by offering an alternate way of
drawing the protein network, a different representation of the experimental data and the ability to
load multiple experiments, each with several conditions, and in identifying and supporting the need
for exploring biological networks at global and local levels simultaneously.


3.1.2     Visualizing and Exploring Networks
There are many techniques or systems for displaying general graphs such as [46, 67, 65, 166]. However
they often fail to translate well to biological networks. Protein network layouts require a constraint-
based approach in which general aesthetic graph-drawing criteria are met, while satisfying other
biological or user-defined constraints. Dwyer and Marriott [56] is the state of the art in constraint-
based graph layout but its complexity, while powerful in its adaptability, makes it hard to implement
and control. Like [17], we chose to implement an algorithm that is easier to adapt to our specific
problem. The layout algorithm itself is close in several aspects to the one described in [66] for
drawing evolving graphs. They place new nodes at the barycenter of existing ones, with subsequent
force-directed steps. we use a similar approach to place database-extracted proteins in relation to
pathway proteins.
   The idea of scaffolding graph drawings on another structure, as done in this work, is found in [120].
Here, domain knowledge is used to identify spanning trees within graphs, and the simpler tree layouts
are used as scaffolds for the general graph structure. Similarly, [6] automatically computes spanning
trees as graph scaffolds and demonstrate their methods in the context of biological networks.


3.1.3     Focus and Context
Revealing global aspects of data while also granting access to details is commonly known as Overview+Detail.
A subcategory of Overview+Detail is formed by so-called Focus+Context techniques which show
the global and detailed views simultaneously.       They are often preferred over more traditional
Overview+Detail, such as zooming and panning, which can leave the global picture out of view
when zoomed in on details. Quantitative evidence that may explain this preference was published
by Plumlee and Ware [134] — they show that the cognitive cost is higher when zooming and panning
than when viewing local and global aspects of the data simulatanously on side-by-side displays.
   Several Focus+Context techniques have been devised. For instance, [137] leverages trained hu-
man 3D perception by displaying trees in 3D and using the proximity of objects as a direct focusing
mechanism. Another popular Focus+Context approach is to distort the representation space to give
more screen real estate to focused regions as opposed to context regions. Other examples of such
techniques are [142], [120] or [162].
   The Focus+Context method presented in this chapter is closely related to [160], which interposes
                                                                                                   31


a separate viewing plane between the viewer and the actual scene. Although similar to a regular
lens, this space can be used to display detailed information about the underlying scene.


3.2     Design Elements
This section introduces the design principles and implementation details employed by the visualiza-
tion methods presented in this chapter. First, an overview of the proposed visualization workflow is
given. Then, details about each of its components are provided.
   Researchers analyze their data in the following workflow:

  1. import a model of a canonical pathway representation either by loading a signaling pathway
      image and preprocessing it to help the system infer the structure (Figure 3.2, lower left) or
      by specifying the model explicitly by placing proteins and interactions on an empty canvas
      (Figure 3.2, lower right);

  2. load one or more quantitative experimental datasets;

  3. automatically extract proteins and interactions from protein interaction databases such as
      HPRD and build a network around the pathway model specified in step 1 and the quantitative
      data from step 2;

  4. represent the network graphically using a novel canonical pathway-oriented layout (Figure 3.3);

  5. explore and analyze the network guided by interesting features noted in the experimental data;
      investigate the network at interaction level using a Focus+Context technique; analyze how
      known information blends with the new experimental results using such features as clustering
      of quantitative proteomic data, filtering, highlighting, and information on demand (Figure 3.1);

  6. derive insights or generate new hypotheses, design and run new experiments, and restart from
      step 2.


3.2.1    Pathway Model Specification
The solution described in this chapter requires the user to specify, using a simple interface, the
canonical pathway representation of the signaling pathway under investigation. This can be done
either by putting proteins and interactions on an empty canvas or by using a pathway image that
is preprocessed to help the system extract the pathway structure; the preprocessing entails drawing
single, continuous strokes over or around each pathway element – proteins, interactions and other
entities. These strokes aid the software in identifying image features (Figure 3.3) as detailed below.
   If the stroke endpoints are far apart compared to the stroke length, the image feature is probably
an interaction and the endpoints are matched against protein positions to find which proteins are
involved. The interaction strokes snap to image features in a manner similar to a lasso tool. This
                                                                                                   32


Figure 3.2: Structuring protein interactions around familiar canonical pathways provides intuitive
visualizations. A canonical signaling pathway representation (top) can be imported into the system
in two ways: on the lower left the pathway image itself is loaded into the system and preprocessed by
circling proteins and drawing over interactions; the pathway features are then inferred from the user
strokes and image features and shown here in black; or, on the lower right, protein and interaction
icons are placed and dragged on an empty canvas to create a new pathway model. After positional
assignment of each protein, the software aids in associating interaction database accession numbers
to each of the newly defined canonical pathway proteins.


is done in order to obtain the correct image region that the interaction is covering, for reasons
described in Section 3.7.
   If the stroke endpoints are close relative to the total length of the stroke, the feature-detection
algorithm decides to classify the feature as a protein. It then computes an average color for the area
enclosed by the stroke and removes all points dissimilar to it. In most cases this leaves only the
image shape selected. The protein position on the canvas can be inferred through this computation.
   If a selection is unsatisfactory the user can cancel it and try again – depending on the previous
selection, the algorithm will attempt to correct the image-processing parameters for the second try.
For instance, if the area selected by the user is much larger than that returned by the algorithm,
the color-similarity threshold is increased.
                                                                                                   33


   Once the graphical model is complete, either by pathway processing or by pathway drawing, the
placed proteins need to be linked to protein identifiers in the protein interaction database. The user
chooses the correct protein by searching the database for keywords using a dedicated dialog box. In
our test cases this process took between 15 and 30 minutes for medium pathways such as those in
the figures, but these times vary with image complexity and user training.


3.2.2    Interaction Data
The experimental prototype described here uses the HPRD protein interaction database. HPRD
is a protein interaction and metadata source based on manual literature search. The database
information is stored and loaded as flat files.
   The network exploration paradigms defined here could be used with any protein interaction
database. One of the main challenges in supporting a protein interaction database is providing
access to useful metadata from other databases. This is due to the inherent difficulty of translating
protein identifiers across independent protein databases.


3.2.3    Experimental Data
The quantitative proteomic data is loaded as XML or flat files upon pathway creation and can
contain multiple quantitative data points as well as protein identifiers and other metadata. For
graphical representation, the quantitative proteomic data are transformed into a colored heatmap
representation (Figure 3.1) indicating fold changes of a given peptide across different experimental
conditions (time course of receptor activation or comparison between wild type and mutant cells).
The following color-coding is used: blue – decrease of proteomic quantity, yellow – increase of
proteomic quantity, black – no change.
   If multiple experimental files are loaded, as in a comparison between wild type and mutant cells,
special types of heatmaps are computed for each pair of experiments to reflect changes between
experiments: yellow then indicates a major change between the two experiments, while black corre-
sponds to no change. A single protein can have multiple heatmaps, one for each assigned peptide.
The heatmap icon appears in two places: displayed in the expanded network exploration upper
plane, attached to proteins revealed in the experiment (Figure 3.1), and in a dedicated panel on the
right (Figure 3.1) containing all peptides discovered in an experiment.
   For multiple quantitative data sets, the heatmap experimental data panel on the right (Fig-
ure 3.1) is configured to contain tabs not only for each separate experimental data-set but also for
changes observed between pairs of data-sets. For instance, in a phosphoproteomic receptor activation
timecourse experiment involving wild type and cells lacking critical signaling proteins, the heatmap
tab contains one tab dedicated to timecourse phosphopeptide heatmaps in the wild type cell, another
tab for the mutated cell, and a third tab displaying the fold change of individual phosphopeptides
observed between the two cell types through the receptor activation timecourse. This feature can be
particularly useful in knockout-type experiments since the differences in behavior between a normal
                                                                                                   34


Figure 3.3: Proteins and interactions from HPRD (small fonts) that are connected to the canonical
pathway model are: (left) integrated directly into the signaling pathway image with one protein se-
lected and its interactions highlighted; (right) structured around a user-constructed model; different
classes of proteins have different appearances: experimental proteins are colored yellow, kinases are
drawn as hexagons and receptors as irregular stars; several experimental proteins are not known to
be connected to the pathway and are therefore located in the lower right corner. HPRD proteins
are placed in a structured manner between the pathway proteins based on their separation from
the pathway proteins. (cutout) Disadvantage of simply drawing the network on top of the pathway
image: HPRD interactions obscure elements of the canonical pathway; compare to the improved
method (left) in which important pathway elements remain in the foreground.


and a mutated cell become evident immediately.
   The experimental data panel is kept visible at all times so that researchers can use it to explore
the new quantitative data systematically. The items in the experimental data panel can be used to
start the exploration by linking directly to Focus+Context representation.
   Using experimental data to guide exploration was also discussed in [17]. The work here differs
both in the way the information is presented to the user and in the emphasis that is put on com-
parative analysis of multiple experiments. Such analysis can also be performed in the other system,
but the small multiple approach is likely to overload the display if used with dense networks and
large quantities of experimental data. Their parallel coordinates view was also not extended for
both multiple time-points and multiple experiments.


3.2.4    Network Generation
From the user-provided pathway skeleton, the software constructs a protein-protein interaction net-
work by loading proteins and interactions from the HPRD database. The network is grown iteratively
in a breadth-first manner: first, proteins interacting directly with the canonical signaling pathway
model are imported, and then in subsequent steps, proteins interacting with those added in the
previous iteration are extracted from HPRD and included. Finally, interactions among all proteins
                                                                                                   35


are loaded.
   The number of levels to grow the network and optional filters used to exclude proteins from
the build process are specified by the user. However, growing the pathway from the user-specified
proteins alone may leave experimental proteins outside the network. To ensure inclusion of all
experimental proteins in the final visualization, the network is also grown from the experimental
proteins themselves. This solution increases the chances of linking the experimental proteins to the
pathway since two networks are grown simultaneously toward each other.


3.2.5       Computing Protein Positions
While the canonical pathway proteins have user-provided predefined positions, the system must
compute where to put the proteins extracted from the interaction database. These proteins are
placed depending on their distance, in terms of number of interactions, from each of the pathway
proteins. If protein P is interacting directly with protein A and is three interactions away from
protein B, it is placed on the line segment between A and B, closer to A. The distances are not
necessarily directly proportional to the path lengths: they can be weighted so that direct connections
are much shorter than longer interaction paths.
   Essentially the nodes are placed at a path-length weighted barycenter of the pathway nodes.
Barycenter positioning was also used in [66] to place new nodes in relation to already existing ones
in the context of evolving graph drawings. This algorithm produces positions close to those computed
by a traditional spring layout algorithm, since a node is dragged by the edge springs to a similar
location.
   This methodology leads to identical positions for some proteins, however, and a force-directed
approach based on [67] is used to perturb the layout and remove overlaps; a simple linear grid
approach is used to improve the performance of the layout algorithm by using vicinities to reduce
the number of comparisons needed to compute forces on protein-nodes.
   The sizes of nodes are taken into consideration when computing repulsive forces. The aspect
ratio of nodes in relation to the force vectors can also be taken into account so that forces are
applied anisotropically. This leads to slightly longer run times but minimizes overlap, especially in
augmented pathway images where some nodes can be much larger in one direction.
   As a special case, positions cannot be computed for proteins linked only to the experimental data
and not to the known pathway. These are placed in the lower right side of the display, yielding a
cluster of proteins that are not known to be connected to the pathway.
   This algorithm is relatively fast, interactive, and achieves the desired results without the com-
plexities of more powerful constraint-based techniques such as [56]. The layouts in Figure 3.3 took
around 2 minutes to compute. We also experimented with simulated annealing methods. These,
however, were much slower and did not improve the layouts significantly due to the high network
density. Some parameters inherent to force-directed methods still require user adjustment.
                                                                                                     36


3.2.6    Augmenting a Pathway Image with Dynamic Data
The case of specifying a pathway image and integrating dynamic information seamlessly into the
already existing representation is more complicated than assembling a completely new visualization.
Simply drawing the database extracted elements on top of the pathway image has several disad-
vantages, as shown in the cutout of Figure 3.3. In contrast, the method presented in this section
creates the illusion that the proteins and interactions drawn dynamically are part of the pathway
image (see Figure 3.3, left).
    The following specialized operations are used to create the illusion that the HPRD proteins and
interactions are part of the pathway image. The shapes and locations of proteins and interactions
in the image are computed in the image preprocessing step. They are then used in the layout stage
to minimize overlap (dynamically loaded proteins tend to move to “empty” image areas). Finally,
they are copied from the image and redrawn as masks on top of the final network. This technique
ensures that the pathway model stays on top of the dynamic network and gives the illusion that
the canonical pathway representation and the dynamic network coexist and interact (see Figure 3.3,
left).


3.2.7    Exploring the Network
In our design the interaction network can be explored at two levels simultaneously: at a global
level, where the signaling pathway and other high-level structures are evident, and at a local level,
where only one protein and its neighbors appear in detail as the researcher jumps from protein to
protein in the network. The two types of visualization coexist as two parallel planes, the local one
gliding above the global one (Figure 3.1). With these complementary views of the pathway space,
the user explores the network in the detailed space that is rich in focused protein information while
maintaining an overview of the explored area and orienting the expanded exploration to his or her
location within the global view.
    Exploration is done in a plane that hovers above the global view and shows in detail only one
protein and its interactors. Initial access to the exploration plane can be obtained by double-
clicking proteins in the global-view, in the experimental lists, or in a list of all proteins present in
the visualization. While in exploration view, clicking one of the interactors shifts the center of the
view to this selected protein, a change performed through smooth animation to maintain context
understanding. Standard zooming and panning using mouse controls are also available, but testing
has found them less favored by users. Proteins in the exploration plane are arranged so as to mimic
their placement in the global layer while satisfying aesthetic criteria such as minimum distances
between proteins or interaction overlap (Figure 3.4, left). The effect is achieved by applying a
simulated annealing [46] algorithm that attempts to maximize layout similarities while ensuring a
pleasing drawing. The area allocated to the exploration view is computed dynamically on the basis
of the number of proteins to be displayed. A view that places the main protein in the center and its
interactors circularly around it is also provided.
                                                                                                     37


   Clicking a protein in the exploration view highlights it and its neighbors in the lower plane,
making it easier for the user to establish a correspondence between the two.


3.2.8    Visualization Prototype
A compact set of features were added to the testbed system, allowing our researchers to operate
on the network data and pose visual queries. For instance, selectors and the ability to adjust
appearance allow the researcher to highlight interesting aspects of the visualization. In the right
panel of Figure 3.3, a user has selected various groups or classes of proteins and attached to them
special visual attributes such as shape and color a technique often used in stylized signaling pathway
representations. The method described in [121] is used to highlight interactions of one or more
selected proteins. Easily extensible filters allow a researcher to remove proteins deemed uninteresting.
One potentially useful filter with significant effects keeps only proteins that connect a set of user
selected proteins.


3.2.9    Implementation Details
The prototype application was written in C++. The G3D 6.7 graphics library was used for 3D
graphics and rendering and the Qt 4.3 library for user interface elements. The HPRD database can
be downloaded as flat files together with the application.


3.3      Evaluation and Findings
The results of this work are findings about ways to improve analysis of protein interaction networks
and quantitative proteomic data, and novel visualization and exploration paradigms motivated by
these findings.
   The research presented in this chapter was driven and validated by insights obtained during our
collaborative development process and by an anecdotal evaluation with domain experts. Results
indicate that applying these concepts in the context of systems for visualizing protein-protein in-
teraction networks may accelerate the discovery of new connections among quantitative proteomic
data, interacting proteins, and canonical signaling pathways. While a controlled study may still be
needed to verify and quantify the benefits of individual aspects of our methods, the anecdotal evalu-
ation with domain experts is a preferable approach in an iterative design setting, with no predefined
requirements since it can provide fast, easy access to usability information on high-level analysis
tasks.
   Evaluation was performed on the analysis of phosphoproteomic experiments with the help of four
proteomic researchers interested in research of the T-cell and Mast cell. These experts artificially
stimulate cells and measure the amount of phosphorylation that occurs on proteins as a result.
Phosphorylation is an important cellular process by which a phosphate is added to a protein or
other molecule. A protein can be phosphorylated in multiple places, called phosphorylation sites.
                                                                                                   38


   In a single experiment setting, phosphorylation measurements over multiple time-points can
provide causality hints. More importantly, however, researchers can run separate experiments before
and after inhibiting an investigated protein. By comparing changes in measured phosphorylation
values they can hypothesize about the role of the investigated protein in the cellular pathway.
   Finding 1: Visually combining experimental data and known protein interactions
enhances analysis
   Results presented in this section augment previous results from [17] and [148] with similar find-
ings in a different analysis setting. Specifically, results suggest that coupling new experimental
data with protein interaction data extracted from public databases within a unified visual analysis
can shorten the analysis process of a new experimental dataset from weeks to days. In addition
to the straightforward time gain, shorter time intervals between individual data observations lets
researchers integrate them more efficiently into a cohesive hypothesis.
   By using the prototype, one of the collaborators who took part in the evaluation quickly discov-
ered a meaningful biological fact that eluded her in previous analyses of a T-cell related phospho-
proteomic dataset. She started by browsing through the list of experimentally measured proteins,
displayed as seen in Figure 3.1 on the right hand side of our prototype. She then decided to take a
closer look into the protein Slp76, because of the variation reflected by its heatmap. Double-clicking
on the list item opened a detailed exploration view, as shown in Figure 3.1. The visualization re-
vealed that this protein was known to interact with the protein VAV. Metadata available within
the software then revealed that the particular measurement could indeed be related to that specific
interaction. In addition, a novel phosphorylation site was detected on SHP1. An interaction with
SLP76 and meta-data about this interaction were easily accessible in the software and led to the
hypothesis that SHP1 negatively regulates SLP76.
   These insights may have been eventually produced using the collaborators’ previous strategy of
manually querying each experimentally measured protein and gathering information about them.
The integration of experimental data and the protein interaction network reduced the time needed
to make this discovery.
   Finding 2: Canonical pathway-driven layout is intuitive for proteomic researchers
   Structuring dynamically extracted protein interactions around a familiar canonical pathway (see
Figure 3.1) provides an intuitive visualization that helps proteomic researchers orient themselves
and learn the interaction network quickly. A proteomic experiment revealing hundreds to thousands
of protein modification sites overwhelms users with the many unfamiliar proteins. Becoming famil-
iar with the proteins in such an experiment is greatly facilitated by placing those proteins within
signaling pathway-structured protein interaction networks.
   This pathway-structured method was motivated by negative feedback on an initial prototype
that used a standard force-directed network layout. This feedback suggested that generic network-
drawing algorithms fail to place proteins in positions that are meaningful either from a biological
or a pathway-conventions standpoint (receptors can end up near the nucleus). Moreover, proteomic
researchers were overwhelmed by the unstructured node-link diagrams such methods produce and
                                                                                                  39


tried to map the new visualization to the signaling pathway they were using before. This was also
found by [17] and [98] to be an important issue in systems that employ traditional graph drawing
algorithms to display protein interaction networks. The work presented here differs from theirs by
introducing a novel visualization paradigm to address this problem.
   In a broader visualization context, integrating dynamic connectivity information into static di-
agrams is a potentially useful concept because it facilitates the integration of new information into
existing thinking schemas. We demonstrate its perceived usefulness in a proteomic context. More
targeted research is needed to establish whether the perceived benefits translate to actual task
improvement, and to identify other areas of application.


Figure 3.4: Exploration plane versus zoom-and-pan. (left) The network is explored in a separate
plane showing only one protein and its interactors. Selecting an interactor changes the view of that
particular protein via a smooth animation. This interaction network crawling method allows system-
atic discovery of connections among proteomic data and existing protein knowledge. Transparency
keeps the global view visible and the same protein is highlighted within both planes. The protein
layout in the exploration plane mimics the layout in the global plane, but is slightly distorted to
achieve a more attractive representation. Changes in peptide abundance are represented as linear
heatmaps. (right) Zooming and panning, while also available to explore the network, have several
drawbacks: the view is cluttered, some interactors reach outside the viewing area, there is no space
for additional details, and the global perspective is lost.

   Evaluation
   Overall, the experts preferred this layout method over two network visualizations they tried
before: Cytoscape [148] and an earlier interaction network prototype. At the time of their use, both
systems used traditional graph drawing algorithms and were criticized for their lack of structure.
   Conclusions drawn from specific user comments were: the familiar pathway model that seeds
the exploration is visually appealing and reduces the initial ball-of-strings shock associated with
most network visualizations; it helps users orient by providing a familiar context; it gives protein
placements more meaning and ensures that well known proteins are placed in familiar locations.
   A problem identified in early testing was that growing the pathway from the user-specified
proteins alone omitted many experimentally observed proteins from the network due to the lack of
connections between these proteins and the known pathway within the protein interaction database.
This problem was addressed by growing the network not only from the pathway proteins but also
                                                                                                    40


from the proteins indicated by the experiment, thus ensuring their inclusion within the network.
However, some experimental proteins will still not be connected. These proteins are placed in the
lower right corner of the representation, essentially forming an island of proteins revealed in the
experiment but not known to be connected to the user-provided signaling pathway skeleton.
   This approach has its benefits, as one test case revealed. After loading a large phosphoproteomic
dataset onto the well established insulin pathway, a user immediately observed that many of the
experimental phospho-proteins were connected to the signaling pathway, while the “island” of un-
connected proteins was fairly small. This increased the user’s confidence in both the experimental
results and the visualization.
   Finding 3: Global and local exploration modes (multilayer, multiscale views)
   The evaluation shows that researchers prefer to explore an interaction network by using a local
view of each protein, looking only at its direct interactors at a time (Figure 3.4, left). This initial
hypothesis guided the design choices and was validated during evaluation.
   During the testing stages much time was spent in local view instead of global view. This finding
suggests that protein network analysis benefits from views that isolate one protein and its inter-
action from the rest of the network. Current interaction network visualization frameworks lack
Focus+Context capabilities, and little research exists to address this issue.
   Evaluation
   The evaluation revealed that the exploration plane was indeed the most popular mode of protein-
network exploration. A second demonstration and usage session with a separate proteomic group
led to the same conclusion. The global view was used to apply filters, browse through the data and
jump-start exploration. It also created an important first impression of the visualization as a whole
and kept the users engaged. For reasoning about connectivity however, researchers rarely looked
directly at interactions in the global view, even though zooming and panning were available. The
navigation plane was used instead.
   Observations of proteomic workflows during development and evaluations suggest that current
proteomic analysis happens mostly at interaction level. This explains why our Focus+Context
method was preferred over traditional global exploration: a single protein and its interactors can be
viewed without clutter from any other network elements; all interactors are visible at once without
panning; the space can be distorted to make room for additional glyphs and information associated
with the proteins; and both views – global and local – are visible at the same time, with an emphasis
on the local view.
   Given the synergy between local and global viewing, with a stronger emphasis on local exploration
for accurate, analysis tasks, the method presented in this chapter appears to be adequate. The local
view is in the focus, while the entire global view is maintained in the background as a mental anchor.
The user can switch between views immediately using an intuitive operation that requires minimal
mental transformations.
   Probably the main contribution of this result is that techniques for the exploration of networks
concurrently at varying degrees of detail are suited for proteomic analysis tasks and should be
                                                                                                    41


included in specialized systems. While the presented novel technique works well in this domain,
other Overview+Detail paradigms, such as the ones described in our related work section, may also
produce good results.
   We note that this research used unfiltered interactions directly from proteomic databases. This
resulted in dense networks. Curating interactions that are placed in the pathway could allow all
information to be visible at the same time as seen in some networks presented in [17]. In this case
zooming and panning may be sufficient for interacting with the network.
   Finding 4: Comparative displays of multiple experiments help identify important
pathway players
   Test cases showed that the ability to load and compare multiple experimental results, for ex-
ample from cells containing deleted or mutated proteins, helped researchers link cell behavior to
experimental results. Also, researchers found it useful to have the experimental data permanently
visible to drive the exploration.
   Evaluation
   The first prototype that was developed did not present the user with an explicit list of experi-
mental proteins. Instead they were marked on the network. Users argued that they prefer to be able
to go through their experimentally derived proteins systematically, preferably in a list. This lead
to the addition of an experimental proteins list in one of the system’s submenus. Further testing
showed that users referred to that list throughout their analysis. The conclusion was that having it
permanently displayed and linked to all the views would speed up their analysis process.
   In the final evaluation, the typical analysis workflow consisted of systematically going through
the experimental protein list, selecting ones with interesting patterns as suggested by their heatmaps,
and opening them in local exploration.
   The following test scenario showed the usefulness of this approach: in an experiment, a known
T-cell signaling protein ZAP70 was removed from the cell and quantitative phosphoproteomic per-
turbations were recorded before and after the removal.
   The user started his analysis by examining the heat-maps indicating the fold changes between
the two experiments. The heatmap profile signaled an interesting change on the Lck protein, an
upstream component of the pathway: the phosphoryaltion of Lck was greatly delayed when the
downstream protein ZAP70 was removed. By bringing up Lck in the exploration plane, a direct
interaction was discovered that connected Lck to Zap70 and explained the change.


3.4     Discussion
3.4.1    General Considerations
Maintaining a tight collaboration between researchers from computer science and proteomics lead
to a better understanding of the requirements and specifications of proteomic visualizations. The
canonical pathway-driven network layout and experimental data-guided network exploration are
tangible results such a collaboration.
                                                                                                    42


   Good proteomic visualizations should support and automate part of proteomic researchers’ data
analysis workflows. But identifying these workflows is nontrivial and often varies among individual
labs and researchers. The novelty of experimental data and constantly evolving proteomic method-
ologies make it hard for the researchers themselves to describe their workflows clearly. However, the
process of workflow discovery, while laborious for both proteomic and computer science researchers,
is beneficial for both parties since it identifies where computers can help most.


3.4.2    Layout
One drawback of the canonical pathway-guided layout is the overhead associated with specifying
the canonical signaling pathway within the software. The most laborious step is not so much
inputting the structure but searching for correct protein identifiers in the interaction database;
this can be time-consuming due to naming ambiguities, multiple matches, missing proteins, and
inconsistencies across protein databases. Initial testing revealed that identifying correctly canonical
signaling pathway proteins within the protein database is aided by additional cues and metadata
such as number of interactions or interacting partners.
   The average time required by users to input the pathway skeleton and attach database identifiers
was around 20 minutes for medium-sized pathways like those shown in the figures here. This overhead
is acceptable if one considers that researchers commonly spend months or years studying a few
pathways. Moreover, a canonical signaling pathway skeleton, once constructed, can be used to build
multiple networks for different proteomic experiments and parameters.
   Proteins imported from databases are by default displayed smaller than pathway proteins and
are sometimes not legible without zooming. This mode of display was motivated by the desire to
keep the pathway structure in the foreground and by the need to save canvas space and minimize
overlaps in dense protein networks. However, the default size settings are adjustable and users can
customize them for individual classes of proteins.
   Another issue related to protein glyphs is that proteomic researchers often place several icons
corresponding to the same protein in various places on the canvas, usually depending on the specific
function and context. Many-to-one correspondences between graphical icons and data entities are
uncommon in network representations. The testbed application allows this type of representation
by automatically adding numbered suffixes to identical proteins to differentiate them. However,
extensive use of this feature tends to clutter the representation with redundant information since
interactions are replicated for each copy of the same protein.
   Augmenting a pathway image with dynamic data does not always work. While the application
is designed to accept any type of image, low image quality or high complexity can render the
system unable to extract the pathway structure. The feature detection algorithm is flexible and
can automatically adjust its parameters based on user feedback. However, the image-processing
techniques were not the focus of this research and are not state of the art.
                                                                                                      43


3.4.3    Focus and Context Exploration
The local exploration plane received positive feedback and was used extensively during evaluation.
Its simplicity is both an advantage and a limitation. Users can easily understand what the display
is showing and how to crawl around the network, while the visualization avoids clutter and in
most cases does not require zooming or panning. Showing a single network level, however, can
make it difficult to determine the optimal direction for future exploration. Unfortunately, real-life
uncurated protein networks have high graph degrees that limit the number of levels we can show
without clutter. Possible solutions to this problem are: hyperbolic views, automatically adjusting
the number of levels that can be displayed without clutter, or attaching glyphs to nodes that provide
cues about interesting exploration directions.
   The decision to place the exploration plane on top of the global view rather than using a separate
window was primarily motivated by the desire to save screen real estate. This choice has the
disadvantage of occlusion, but we believe this is outweighed by the ability to use the entire display
area for exploration while preserving a view of the global layout in the background. This situation
arises frequently in protein interaction networks since many proteins are highly connected and need
large display areas. The area assigned to the exploration plane is computed dynamically depending
on the number of proteins to be displayed, thus minimizing occlusion as much as possible. The
transparency of the exploration plane is also adjustable. The current proliferation in screen real
estate, even in common analysis settings, opens the way to placing the two views next to each other.
This approach would remove the occlusion problem but the need for frequent changes in focus across
views and to spatially relate elements across the two views might lead to an additional cognitive
cost.
   Observations of the users confirmed this design choice: the global view was used mainly as a
visual reference, especially for large networks, and as support for posing visual queries using selectors
and filters. These tasks are not significantly hindered by occlusion. Proteins and interactions were
rarely looked at closely in the global view, a task that occlusion would affect more.
   It is also possible that using some filtering criteria on protein interactions will lead to sparser,
more relevant networks like those featured in [17]. These could then be fully legible and explorable
at a global view, potentially minimizing the need for a separate exploration view. However, the
domain experts who took part in the evaluation have not identified in the biological databases they
currently use any such criteria that can be automatically applied.
   The placement of interacting proteins within the upper expanded view plane is designed to mimic
the placement within the global lower plane while preserving aesthetic criteria such as node overlap.
In addition to highlighting in the global view the interacting proteins that are being explored, this
allows the user to better relate the exploration views to the global view.
   The view during exploration can be either tilted or parallel to the view plane. An in-depth
analysis of the benefits of each type of projection was out of the scope of this work. However, several
users a strong preference for the tilted view, but this preference can be attributed to the superior
visual appeal for a 3D representation rather than a perceptual benefit. Negative comments about
                                                                                                44


distortions caused by the perspective projection seem to support this hypothesis.


3.5     Concluding Remarks
This chapter introduced several novel visualization methods and paradigms for the analysis and
quantitative comparison of multiple proteomic data sets in the context of published protein-protein
interaction networks and known signaling pathways. The effectiveness of the methods was evaluated
in terms of data insights, hypothesis generation, and improvements in analysis time. Specifically,
we showed that tightly coupling known protein interaction information with new experimental data,
scaffolding protein interaction information around familiar signaling pathway models, and explo-
ration at varying degrees of detail will increase adoption rates among proteomic researchers and
accelerate knowledge extraction from massive quantitative proteomic datasets.
Chapter 4

A Map Inspired Framework for
Accessible Data Visualization and
Analysis

Visualization of biological data can be thought of as lying on a continuum. At one end, database-
driven websites provide sparse representations of small, manageable bits of data. At the other end,
complex stand-alone visualization systems offer many different visualization options and analysis
features. Both approaches have merit and are widely used, but both have task-specific limitations.
In terms of usability, the former have low visual expressivity and usually do not incorporate large
data sets or complex computations, while the latter have significant overhead associated with setting
up and learning to control the environments. In our experience, most biology researchers use one
or two established analysis environments but are generally unable to invest time in learning new,
experimental visualization tools. From a dissemination standpoint, scientists producing data lack
the expertise required to set up and maintain a database-driven website. Finally, turning a prototype
into a usable system can be a daunting task for visualization researchers due to the high costs and
low benefits of GUI refactoring, automatic parameter tuning and creation of user manuals.
   In this context the chapter introduces scientific data maps, pre-rendered visualizations of most
data tied to a subdomain or scientific problem. They are handled over the web, possibly through
the Google Maps API, and have a simple and intuitive set of interactions that can be learned with
minimal overhead. An evaluation with domain experts shows that this approach is a viable solution
for specific users and tasks and provides advantages from both ends of the visualization continuum
while limiting many of the drawbacks mentioned above.
   The key differences between traditional approaches and scientific data maps are as follows. In-
stead of the data-query-specification/recompute paradigm, maps contain all of a user’s data or the
views derived from them. Data query is thus done through zooming and panning during visual-
ization. Traditionally, it is the end user’s job to construct a visualization (query specification and
                                                 45
                                                                                                      46


Figure 4.1: Five examples of digital map visualization (from left to right): gene co-expression and
heatmap representations, a genome-viewer, a protein interaction network and a brain tractography
projection.


parameter definition), while maps are built by visualization experts or bioinformatics staff in larger
labs. Finally, the goal of visualization systems is to give users complex functionality that answers a
large array of questions. Maps, on the other hand, aim primarily to provide fast, intuitive access to
visual data; their functionality is therefore balanced with a sparse set of interactions, close to what
is available in regular Google Maps.
   Advantages of data maps occur on both the user and the visualization researcher sides. For users,
including scientists browsing and analyzing data as well as those producing data, visualizations
become easy to access, learning time is significantly reduced, users worry only about the data, and
disseminating visual results is simplified. On the side of the visualization researcher, fast prototyping
can be used for the rendering application with little effort invested in interfaces; there is no concern
for computation and rendering time; visualizations are easy to distribute to both test and end users;
and powerful synergies with web-based libraries, such as Protovis, can be produced by creating
focus+context explorations in which the map provides context.
   The motivation behind the work is to let labs publish data and results in visual form along
with raw textual data so that users can access readily analyzable perspectives on the data without
additional overhead. Specifically, the work is driven by the Immgen project [4], a collaborative
effort aimed at generating a complete microarray dissection of gene expression in the immunological
system of the mouse. The data-map concept enables the dissemination of the project’s microarray
data as precomputed visualizations that can be accessed on the project website. However, we show
how the map concept is general enough to be also applied to the two application areas presented in
Chapters 2 and 3.
   The chapter is structured around five specific visualizations, which are implemented as maps: a
                                                                                                  47


genome viewer, a 2D embedding and heatmap of gene expression data, a protein-interaction network,
and a white-matter tractography map. All have been evaluated informally with domain experts and
two have been deployed and are in use on the Immgen website.


4.1     Related Work
Here we discuss existing approaches that are relevant to our work. The section starts with a dis-
cussion on web based visualization, it continues with an exposition on the Google Maps API and a
presentation of visualization systems for biological data. We end with techniques relating to each of
the five specific examples.


4.1.1    Web Based Visualization
Data visualization has been available on the web for many years but has usually displayed a limited
amount of data using basic graphics and interaction. More recently visualization research started
to target this environment and advanced applications have emerged. ManyEyes [169] paved the
way for everyday data visualization, while other studies [168, 43] prove the need for accessible web
visualization.
   While web-development toolkits such as Protovis [24] greatly aid web development, large scale
web visualization is hampered by inherent browser capabilities [96]. Alternatively, stand-alone sys-
tems have been made available as applets or to be run as client applications directly from websites
[107, 148]. However, users still have to control the parameters involved in producing visualizations,
specify their data queries and learn system features. This often constitutes an undesired overhead.
   Yet another approach, more similar to our work from an implementation standpoint, is to use
Ajax (asynchronous JavaScript and XML) technology to do the rendering on the server side and
serve images asynchronously to the client browser. A specific call for Ajax-based applications in
bioinformatics is made in [12], while [21] and [75] exemplify this approach. The essential difference
between this work and traditional offline visualization systems is that control and display happens
in a separate place from rendering and computation. The methods presented in this chapter differ
by attempting to limit regular users’ effort in creating visualizations and assigning this task to
experienced personnel, by introducing large visualizations that contain most of the data associated
with a problem, and by using the Google Maps API, a readily available Ajax implementation of
pre-rendered images.
   Closest to this methodology are X:MAP [179] and Genome Projector [11] which present im-
plementations of genome browsers using the Google Maps API. This idea is expanded here to a
broader visualization context, visualization solutions are introduced for four specific examples and
an evaluation of both the preference of Google Maps powered visualizations in general and of the
four specific visualization examples is presented.
                                                                                                   48


4.1.2    Google Maps
We use the Google Maps API, an Ajax framework used to render large maps, to display our visu-
alizations. It receives as input image data in the form of a set of small images, called tiles, that
when assembled together form the different zoom levels of the map. Each zoom level consists of a
rectangular grid of tiles of size 2zoom X2zoom . The API decodes the zoom level and coordinates of
the currently viewed map region to retrieve and display the visible tiles. The developer can load
a custom set of tiles in the API by implementing a callback function that translates numerical tile
coordinates and zoom level into unique paths to the custom tiles.
   The API provides basic functionally such as zooming and panning and allows programmatic
extension or customization with markers and polyline overlays, information pop-ups and event man-
agement. The API can be easily integrated into any Javascript-powered web-page.
   Our visualization development environment was extended with the option to render any of our
visualizations into a set of image tiles instead of the screen, for a specified number of zoom levels.
In addition, each visualization must export auxiliary data to be used for interactive purposes (e.g.
coordinates of genes for gene selection).


4.1.3    Biomedical Visualization
Many advanced systems for biological data analysis have been developed over the past decade. Ex-
amples targeting microarray expression data include free software packages such as Clusterview [60],
TimeSearcher [80], and Hierarchical Clustering Explorer (HCE) [147] or commercial systems such
as Spotfire [3] and GeneSpring [2]. GenePattern [107] is a broad effort aiming to facilitate the inte-
gration of heterogeneous modules and data into a unitary, web-managed framework for microarray
data analysis.
   Similarly, several tools exist for pathway and network analysis: Cytoscape [148], VisANT [82],
Ingenuity [5] and Patika [47], all of which provide features aimed at complex analysis of microarray
data or pathways.
   The goal of the work presented in this chapter is to offer no-overhead visualizations that will be
used primarily for casual data exploration by users unable to spend time learning advanced systems.
In that regard, this work comes closer to applications providing primarily look-up functionality such
as tools published on the NCBI website or the genome browser at USCS [102]. In contrast to these
efforts, the data maps aim to provide visualizations that include more computation and visual cues
and less complicated query specifications.


4.1.4    Multidimensional Scaling
A more complete discussion on multidimensional scaling is presented in Chapter 2. Similarly, the
work here uses the same algorithm with linear iteration time proposed by Chalmers [34]. However,
here, this algorithm is combined with elements from HiPP [130], an algorithm using a hierarchical
clustering to drive a 2D embedding.
                                                                                                     49


4.1.5    Genome Browsers
Genome viewers are used to explore genome structure from the chromosome down to the sequence
level. They can be used to browse the structure of genes, to investigate whether function is linked
to genomic location and to understand genomic conservation or alteration. Many implementations
of genome browsers exist, ranging from basic web-based ones [156, 102, 153] to more advanced
browsers integrated into complex analysis systems [113, 152, 45]. For visual display most genome
viewers string chromosomes linearly, with some exceptions such as Circos [1] and Mizbee [113] that
display them radially.
   The approach presented here comes closest to X:Map [179] and Genome Projector [11], which
use the GoogleMaps API to display precomputed images. The genome viewer presented here differs
in relying on a visual mapping of expression data onto the genome and full genome views to drive
exploration, and in presenting results obtained from evaluation.


4.1.6    Graphs and Protein Interaction Networks
A comprehensive discussion on networks was made in Chapter 3. However, the work presented here
relies on two concepts that were not previously discussed. The following two paragraphs introduce
these concepts.
   Most existing protein-interaction visualizations handle only small networks, making the selection
of relevant sub-networks essential. However, clear guidelines for this task do not exist, and simply
selecting nodes within some separation leads to exponential growth because of the typically high
degree of such biological networks. A solution to this problem was proposed in van Ham’s work [167],
which uses the degrees of interest (DOI) concept introduced by Furnas [69] to select meaningful graph
regions. Similar computations are used in one of the map implementations presented in this chapter
to achieve a zoom-based filter.
   To reduce clutter in the network, Eades and de Mendonca’s vertex-splitting operation [58] is
used. This proposes that nodes which exhibit high tension due to layouting forces acting on them
should be split into multiple copies. There is little work on using vertex splitting for drawing graphs.
Henery et al [78] use vertex splitting in visualizing social networks; however, their method is not
applicable to proteomic networks, which are in most cases unsuited for clustering.


4.2     Design Elements
This section starts with five specific visualizations that examplify the scientific data map approach.
It ends with a distillation of design guidelines and methods for creating visualizations such as the
ones presented in this Chapter.
                                                                                                  50


Figure 4.2: Co-expression map of 23k genes over 24 cell types of the B-cell family exemplifies map
concept. The top view illustrates how maps are combined with client-side graphics: the map is at
the center of the display while selecting genes by drawing an enclosing rectangle generates a heatmap
on the right. Maps have multiple levels of zooming (bottom 2 rows), each with a potentially different
representation. For example, genes are drawn as heatmap glyphs at the high zoom (lower right), and
as dots at low zoom. Expression profiles of collocated genes are aggregated and displayed as yellow
glyphs over the map. As zoom increases, expression profiles are computed for increasingly smaller
regions. Interactions are not limited to zooming and panning; pop-up boxes link out to extra data
sources, and selections of genes bring up a heat map (top panel).


4.2.1    Example 1: Gene Co-Expression Map
Description and usage scenario: Given genes with expression measurements over multiple bi-
ological conditions, we construct a 2D map where genes are placed so that their proximity is pro-
portional to the similarity of their expression profiles. Scientists can use the T-cell co-regulation
map in Figure 4.2 to find other genes that co-regulate with genes of interest and to understand
how these genes co-regulate given the set of conditions described by the map. Immunologists can
                                                                                                    51


browse co-regulation maps to understand expression patterns in the featured conditions. Finally,
scientists interested in downloading unfamiliar data can perform a preliminary investigation using
maps hosted on the project website.
   Our embedding algorithm was inspired by HiPP [130] but employs a different layout technique.
As in HiPP, we use bisecting k-means to create a hierarchical clustering of the data. We then
compute the clustering distance of two genes as the length of the path between their nodes in the
clustering tree. We multiply this distance by the Euclidian distance between genes in the high-
dimensional space described by the biological conditions. Finally, we use Chalmer’s embedding [34]
to project this combined distance in 2D. The discrete component introduced by the clustering tree
is responsible for the clear demarcations between clusters observable in Figure 4.2. We initially
used a standard projection but user feedback indicated that the lack of visible clusters detracted
from analysis. Users considered the modified version preferable even when made aware that cluster
boundaries were introduced artificially.
   In rendering, glyphs are drawn over map regions, showing the aggregated expression profile of
genes in that particular region along with the standard deviation. The size of aggregated regions
is zoom-dependent; as zoom level increases averagings are performed over smaller sets for increased
averaging accuracy. This is achieved by linking zoom to cluster-cutting of a hierarchical clustering of
2D projected distances. In low-level zooms, genes are represented by heatmap glyphs that color-code
the expression value of that gene at each condition, giving users access to individual data values.
The color scheme chosen was blue-green-yellow-red to maximize the perceived expression difference.
   For the Google Map implementation, the visualization was rendered to tiles, gene positions were
exported to a text file, and gene expressions were coded as one-byte values to limit size and were
exported to a text file. These elements are used in the Javascript + Google Maps + Protovis map
implementation in Figure 4.2. Users can search for a gene and highlight it via a marker. They can
also select a group of genes by drawing a selection rectangle. If the selection is small enough (100
genes in our implementation), a heatmap representation is rendered using the Protovis library. The
list of selected genes can be exported for further analysis.


4.2.2    Example 2: Gene Expression Heatmaps
Description and usage scenario: Given a list of genes, each with multiple expression mea-
surements corresponding to a set of biological conditions, a rectangular heatmap representation is
constructed in which each row corresponds to a gene, each column to a condition and each cell
is a color-coded expression value. Rows and columns are arranged so that co-regulated genes and
conditions are placed together. Scientists interested in T-cells can access a number of heatmaps
corresponding to different types of genes to understand regulation patterns.
   This map (see Figure 4.3) exemplifies a low-cost map implementation. The R library was used to
generate a heatmap clustered on both genes and conditions. Text files with the genes and conditions
in the ordering occurring on the heatmap were exported. The heatmap image was split into tiles
and used to generate a Google map. Protovis was used to attach to the right and the bottom sides
                                                                                                 52


Figure 4.3: A heatmap representation is displayed as a map, with gene and cell type axes im-
plemented in Protovis attached on the right and at the bottom. The axes are linked to the map’s
zooming and panning so that users can identify which genes and cells they are looking at. Selection
of an area of interest prompts the highlighting of the corresponding cell types and genes.


of the map axes with gene and condition labels. These axes are synchronized to the map’s zoom
and pan operations so that labels for the currently viewed regions of the heatmap are always within
view. Users can select a region on the map and prompt the highlighting of the corresponding genes
and samples.


4.2.3    Example 3: Genome Map
Description and usage scenario: Given expression values over a set of conditions for any gene, we
create color-coded expression glyphs at genes’ genomic coordinates. Scientists can use this map to
analyze connections between gene function and genomic location and can identify co-located genes
that exhibit similar expression. Such maps can also be used to query and highlight on the map
regions of the genome that are enriched in genes belonging to particular classes, either defined by
an expression (e.g. genes that have higher expression in condition 1 relative to condition 2), or by
functional category (e.g. all tyrosine phosphatases).
                                                                                                   53


   Gene expression is mapped to a blue-green-yellow-red color scheme. Glyphs color-coding ex-
pression values in every condition are created for each gene; a gene-name label is included. The 21
mouse chromosomes are arranged vertically, each extending horizontally. Following user feedback,
no space warping or distortions, such as in [1, 113], have been used. The expression glyphs are
mapped onto this space based on gene location. We use no aggregation of expression for different
zoom levels because inspection of genomic maps and user feedback indicate that co-located genes
most often do not have similar expression patterns.
   Genes are not uniformly distributed on chromosomes; instead, regions with high and low gene
density alternate. In high-density regions the space available to render a gene, assuming finite
zooming, is limited and often insufficient to ensure visibility of the glyph elements. We therefore
spread gene glyphs apart while keeping them anchored through a line to their true genomic positions,
as seen in Figure 4.4. We use an iterative force method for this purpose.
   The visualization is rendered to tiles and gene positions exported to a text file; these elements
are used in a Google-Map implementation.
   Gene search and highlighting of sets of genes are supported. For the latter, results of queries
of type ‘genes with expression in condition A higher than expression in condition B’ are exported
along with the map. Users can select these queries to highlight genes. The highlighting marker is
an image with high alpha in the center and fading alpha towards the boundaries, so that the closer
two highlighted genes are, the more their markers amplify each other, as seen in Figure 4.4. This
ensures that regions with a high density of marked genes stand out even at overview zoom levels.
   The map makes possible three levels of visual queries depending on the zoom: regions of high-
lighted genes stand out at the whole genome perspective; at a slightly zoomed-in level, regions
with similar expression stand out by virtue of similar color patterns in gene glyphs; at full zoom,
individual gene expression patterns become visible.


4.2.4    Example 4: Protein Interaction Networks
Description and usage scenario: Given proteins and interactions between them obtained from
a public database, a node-link map is created (see Figure 4.5). Following a proteomic experiment,
a set of active proteins is determined; a large percentage of them are unknown to the scientist in
terms of function and interactions. To start the analysis, the scientist loads our map and superposes
the experimental proteins. The scientist goes through experimental proteins, learns their interaction
neighborhoods and determines if they are candidates for further analysis.
   Drawing protein interaction networks as static maps is challenging due to clutter and because
related data are not necessarily co-located. Also, because of long edges, zooming may not define a
useful data query. To overcome these challenges, we use vertex splitting and zooming to filter out
proteins based on a protein importance measure. These design decisions were subsequently validated
during our evaluation.
   To ensure co-location of linked proteins and to reduce clutter, we use Eades’ [58] vertex splitting
with a layout algorithm inspired by [68]. The layout space is interpreted as a rectangular grid and
                                                                                                    54


Figure 4.4: Gene expression data measurements over eight cell types of the entire mouse genome are
mapped onto genome coordinates. The top view shows the general analysis framework as presented
on the Immgen website; zoomed-in views appear at the bottom. Three types of visual queries can
be performed, depending on the zoom. At an overview, lists of relevant genes can be highlighted
using Google markers with custom icons - white lines with alpha gradients on each side marking
regions with interesting expression characteristics. At an intermediate zoom (lower left), regions
with similar expression can be identified: a blue low-expression region is visible at center right. At
a zoomed-in level individual expression values and gene names can be identified.


pair-wise force computations between nodes are restricted to nodes located in neighboring grid cells.
Discontinuities at cell boundaries are reduced by fuzzy assignment of nodes to cells. Specifically, a
node close to its geometric cell’s boundaries, will have a fair chance of being considered as being
part of a neighboring cell.
   Once the spring system reaches stability, tensions on nodes determine the opportunity for a
vertex split. Given a dividing line running through a node, force vectors acting on each side of the
line are added together and projected on a direction perpendicular to the dividing line. Due to
performance issues, our system never reaches perfect equilibrium; thus we consider the tension on
a node as the minimum of the two opposing force magnitudes. Multiple division lines are probed
to find the maximum tension on a node. The node with maximum tension is split if the tension
exceeds a threshold (since our objective is not to planarize the graph). The splitting process involves
creating two copies of the node and assigning edges corresponding to whether the force vectors they
created were on one side of the split or the other.
   To deal with clutter, we chose to prioritize what nodes are shown at overview zoom levels by
computing a relevance measure for proteins. As in [167], this relevance measure is computed as a
                                                                                                  55


Figure 4.5: Analysis of quantitative proteomic data in the context of a protein interaction network.
The top panel shows an overview of the analysis setup. Time-course proteomic data is displayed on
the lower left. The experimental protein selected in the list is highlighted on the map. A second
protein was selected on the map and has its interactors and meta-information displayed. All instances
of this protein are listed on the upper left, together with their interactors. Three additional zoom
levels are shown on the lower row; as zoom level increases, less relevant proteins are added to the
display.
                                                                                                       56


function of a protein’s intrinsic relevance and a relevance diffused from neighboring nodes. We alter
the diffusion term to avoid elevating the relevance of proteins connected to a highly relevant protein
but nothing else, a common situation in satellites of large protein hubs. Given a protein P , we first
compute breadth-first-search subtrees rooted in P ’s neighbors, P itself not included. We name the
subtree corresponding to neighbor Ni as NP,Ni - the neighborhood of P through Ni - and compute
R(NP,Ni ) - the relevance of NP,Ni . As in [167], this is the maximum of the intrinsic relevance of each
node in NP,Ni weighted by a factor that decays exponentially with increasing distance from the node
to P . We then consider P as connecting all possible pairs of NP,Ni . We name this the connectivity
relevance and compute it as Rc (min(R(NP,Ni ), R(NP,Nj ))). The final diffusion term is a maximum
of all Rc ’s. A protein’s intrinsic relevance is computed as a mix of the following: protein degree,
occurrence in a specific pathway (e.g. T-cell), and occurrence in experimental data-sets obtained
from our collaborators.
   The relevance score is used to place proteins in bins, much like the city-versus-town distinction
in a map analogy. Proteins are first sorted in descending order of relevance. A number of levels
for the visualization is decided, five in our examples. Each bin i then receives a contiguous set
of proteins from the ordered list. The layout is performed in stages, one for each bin. The most
relevant proteins are laid out first, and their positions are then frozen for the second bin of proteins
to be placed on the map. The discrete nature of the approach makes the layout suboptimal since
higher-level layouts are not aware of lower-level graph topology. To alleviate this problem, we allow
two or more levels to coexist while having a single one be current. Network elements in non-current
levels exert less force than those in the current level and are less likely to be split. In a sense, then,
non-current levels provide guidance for the current-level layout.
   At rendering, nodes are displayed only if their bin-index is lower than a threshold based on
the current zoom. Node sizes are adjusted by zoom level to reflect differences in relevance while
preserving a sense of uniform scale throughout different zooms.
   The visualization is then rendered to image tiles. To facilitate node selection, we export for
each tile a corresponding text file containing the bounding boxes of proteins, or parts of proteins
that appear in the tile. Upon a mouse-click on the map, the tile contents file is retrieved and the
information it contains is used to check for intersections with proteins. For edges, each protein
points to a file containing endpoint coordinates of its interactions. This implementation conforms
to the Google Maps architecture and avoids the loading and client-side storage of large data files.
As shown in Figure 4.5, we use polyline overlays to achieve node selection using the constellation
technique [121], information pop-ups to display protein meta-data, and markers to highlight proteins
of interest such as experimentally derived ones. To navigate among different copies of the same
protein, a window on the side of the map lists all protein copies and their interacting proteins:
clicking on the lines in the list causes the display to jump to the specific map location.
   Time-course proteomic experimental data can be loaded and displayed as colored heatmaps on
the left-hand side of the map, as in [86]. Multiple experimental datasets, for instance for normal and
mutated cells, can be loaded and toggled between during analysis. Upon an experimental protein
                                                                                                      57


selection, Google markers will indicate the map location of the protein.


4.2.5     Example 5: Planar DTI Tractography Maps
Description and usage scenario: Given a DTI streamline dataset, three planar schematic rep-
resentations show projections of important tract bundles onto the three principal projection planes:
sagittal, coronal and transverse. These representations are distributed in Google Maps, with tract-
bundle selection capabilities. Bundle statistics, in both textual and image forms, are computed and
accessible in info boxes for tract bundles. Users can easily navigate through a large set of trac-
tograms published as 2D maps and analyze differences in statistics for major structures such as the
corpus callosum or cingulum bundle, or find datasets exhibiting desired statistical properties for
more detailed analysis in an interactive system.
   In Chapter 2 we describe a novel planar visualization of white matter tractograms. Here we
introduce a web interface for this type of planar representation by integrating it into Google Maps
and enhancing it with labels, statistics, and links. (see Figure 4.6 ).
   The visualization system described in Chapter 2 was extended to render 2D projections into a
set of image tiles instead of the screen. For each cluster, including both tract-bundled and endpoint
clusters, information required for interaction and browsing is exported.
   Selection information consisting of evenly spaced points along splines and thickness radii for
splines contained in a cluster is exported. In line with the tile paradigm, instead of exporting this
information to a single large file, it is divided geometrically across corresponding tiles and written as
multiple tile-content text files. Upon user selection, the content file of a clicked tile is fetched from
the server and its data analyzed for an intersection. This approach avoids loading and searching
through large files.
   A valid cluster selection is marked on the map with polyline overlays running over tract splines
contained in the selected cluster (see Figure 4.6). For this purpose, spline coordinates for each cluster
are exported to files indexed by a unique cluster identifier.
   Finally, for each tract cluster a variety of metadata accessible during map browsing in information-
boxes, as shown in Figure 4.6, is also exported. A short description and links to the most relevant
publications or research can be manually added for major tracts. A few 3D poses of each tract bundle
are pre-rendered and exported as animated GIF images, indexed by the cluster identifier. Statistical
data, in both textual and graphical form are computed for each cluster and written as HTML con-
tent to cluster indexed files. This information is loaded and displayed in tabbed information boxes
at the user’s request.


4.2.6     General Design Elements
Here are some general design guidelines that can be distilled from the previous examples.
   Data size and specification: To compensate for their static nature, pre-rendered visualizations
should encompass all data associated with a scientific problem. Thus, a visualization can be useful
                                                                                                  58


Figure 4.6: DTI tractography data projected onto the sagittal, coronal and transverse planes. Major
tract bundles are represented schematically by their centroid tract; individual tracts in bundles are
linked from the centroid bundle to their projected end points. Zooming in allows access to smaller
clusters of tracts. Bundles can be selected and pre-computed statistical data along with 3D views
of the tract bundle (“brain view”) can be displayed.
                                                                                                    59


for many queries, since data specification can be done during visualization through zooming, panning
and highlighting. Our work exemplifies this approach. In the genome viewer three different visual
queries can be performed based on zoom level: highlighting regions at an overview zoom, identifying
regions with similar expression levels at intermediate zoom, and access to gene name and expression
at a detailed zoom.
   Individual visualizations sometimes need to be adapted to suit this approach. Our protein
interaction networks use vertex splitting to enable queries by zoom-and-pan and a zoom-linked filter
to address clutter. Our co-regulation map uses expression glyphs that guide users towards gene
groups with specific expression patterns.
   Use: Unlike advanced analysis systems, we have only targeted exploratory, preliminary and
casual browsing of data or lightweight analysis tasks. As we will show in the following section, fast
and intuitive access to visual perspectives of a dataset, even if less flexible than complex systems in
terms of interaction and queries, can help in some cases accelerate analysis. It is hard, however, to
determine how suited this approach is in the context of more complex functionality.
   Users: Users can be divided into data consumers and data producers. In our experience, the
former often perceive a dataset to have a low reward-effort ratio when they are unfamiliar with
the type of data, are generally computer averse or lack access to a computational infrastructure.
The browser visualizations targeting such users should be sparse and intuitive. This may seem
self-evident, but state-of-the-art visualization systems commonly require scientists to understand
visualization-specific jargon (e.g., select a specific graph-drawing algorithm).
   Data producers want to distribute visualizations along with their raw data so that fellow re-
searchers need not run their own analysis. Data producers will use an interactive system to create
the browser visualizations. The assumption is that they are specialists in the data they are dis-
tributing, so that a system can use more complex visualization metaphors.
   Development overhead: Development overhead can vary greatly among visualizations: our
heatmaps are just static images augmented with basic interactivity, co-regulation information had
to be first projected in 2D, and protein interaction networks required an entirely new drawing
algorithm. A simple heuristic is that the overhead depends on the effort required to planarize the
information displayed (e.g., relational data is harder than projected multidimensional data) and on
the amount of data shown.
   Deployment: Google Maps visualizations can be designed to work without dependencies on
databases and server-side scripting. In such cases they can be deployed by simply copying a directory
structure to a web server. This was an important factor for our collaborators in deciding to adopt
this mode of representation.


4.2.7    Implementing Interaction
While reiterating that complex interactions are not the focus of this approach, we give below a few
interaction patterns common in visualization that are possible in implementations based on Google
Maps.
                                                                                                       60


   Selection/Brushing: For selection, positions of selectable elements have to be exported in data
files, along with the pre-rendered visualization. This information is used to translate coordinates
of mouse events into selections. In the co-regulation viewer and heatmap, users select genes by
drawing enclosing rectangles. In the white-matter visualization we export curve trajectories for each
tract-cluster, and use the proximity of a mouse click to a curve as a selection heuristic.
   Highlighting: Elements selected through interaction or search can be highlighted using markers
or polylines (traditionally used to highlight routes in digital geographic maps). Figure 4.2 illustrates
a group of selected genes identified by markers. Polylines are used to implement Munzner’s Constel-
lation technique [121] of highlighting node and neighbor selections on the protein interaction network
(see Figure 4.2) and to highlight tract-cluster trajectories on the white-matter visualizations (see
Figure 4.6). Finally, images shown as markers can be customized to create more complex effects. In
the genome browser for instance, multiple co-located markers with alpha gradients create an additive
visual effect.
   Semantic zooming: Our protein interaction network illustrates semantic zooming by displaying
additional proteins with each increase in zoom level. The map framework allows developers to show
different images at each zoom level. A scene can thus be pre-rendered at different zoom levels, each
with its own visual abstractions. Two important factors to consider are that a visualization can
have only as many abstractions as zoom levels and that exported images double in pixel size with
each additional zoom level. This should be taken into consideration in designing the number of
abstractions, as thirteen-level visualizations are infeasible to distribute (see following section).
   Filtering: Semantic zooming can be used to implement filtering. As mentioned before, our
protein interaction network (Figure 4.5) illustrates this concept. While not implemented in any of
our visualizations filtering could also be achieved by rendering multiple complete tile-hierarchies for
pre-determined filtering conditions. Completely dynamic filtering is infeasible using pre-rendered
visualizations.
   Data aggregation/abstraction: In our co-regulation viewer we average expression values over
groups of genes at varying levels of specificity. In the genome viewer we contemplated displaying
aggregated expression values over larger genome regions at overview zooms to deal with gene density,
but chose a different approach following user feedback. Semantic zooming is, however, a good way to
implement varying degrees of data abstraction. Another way is to use combinations of markers with
custom icons to create glyphs that show aggregated data; this has the advantage that such effects
can be created programmatically at run time. A simple example is seen in our genome browser
where selection glyphs create an aggregated visual effect.
   Details on demand: Figures 4.2,4.3,4.5,4.6 illustrate how information popups are used to
retrieve information about visualization elements. Figure 4.6 shows how pre-computed statistical
data and 3D-poses can even offer different perspectives of selected data subsets. A second detail-
on-demand implementation is shown in Figure 4.3: mouse hovering generates a tooltip overlay. For
more interactivity, browser-side graphics can be coupled with Google Maps. The co-regulation map
(Figure 4.2) uses Protovis to show expression values of user-selected genes as heatmaps. We note
                                                                                                    61


Figure 4.7: Linked co-regulation maps of the T-cell (left) and B-cell (right) families. A selection in
the T-cell map is reflected onto the B-cell map. A few groups of genes that are co-regulated in both
cell families are noticeable by inspecting the upper part of the B-cell map.


that information used in the detail views (e.g. expression values, 3D-poses etc) must be exported
along with the rendered tiles.
   Overview+Detail: The implicit Overview+Detail mechanism in Google Maps is the mini-
map. However, more complex interactions can be achieved with browser-side graphics or multiple
synchronized Google Maps on the same page. The closest feature to this in our implementations
is the dynamically generated heatmaps in the co-regulation viewer. However, it would be easy
to extend the protein interaction network by a linked Protovis viewer that displays local network
information for selected proteins.
   Brushing and Linking: Two of our evaluation subjects noted that linking several of our
visualizations together can be beneficial. For example, linking co-expression views (e.g. for different
cell families) can answer questions about conservation of gene function over multiple conditions. This
functionality was implemented for the co-expression maps using browser cookie-polling, as shown in
Figure 4.7.


4.2.8    Improving Performance
Below are a few considerations for improving the performance of tiled visualizations.
   None of our visualizations required more than nine zoom levels. Assuming a tile size of 256
pixels, these translate into square images with 28 ∗ 256 = 65536 pixels on the side, at the largest
zoom level. Furthermore, the number of tiles quadruples at each additional zoom level such that
                                  P
                                  8
these visualizations consisted of   2i ∗ 2i = 87381 image files. Efficient image compression is
                                     i=0
desirable to reduce space requirements and speed up tile loading. Tile numbers can also be reduced
by exploiting that visualizations often contain areas of empty background. Thus, many tiles can
                                                                                                      62


                                       All tiles                 Non-empty tiles
                             PNG                   JPG          PNG         JPG
                Co-reg.      (5461 37.6)      (5461 39.9)    (3505 35.1)    (3505 30.2)
                Heatmap      (5461 23)        (5461 29.4)    (2811 12.7)     (2811 19)
                Networks     (5461 32.2)      (5461 33.6)    (4620 29.8)    (4620 25.3)
                Brain        (5461 37.6)      (5461 39.9)    (3505 34.1)    (3505 32.2)
                Genome       (5461 35.1)       (5461 38)     (4051 27.1)    (4051 30.3)
                Genome*      (87381 263.4)       (x x)       (17630 100)       (x x)

Table 4.1: Number of tiles and disk space(MB) for the five visualizations with different image
compression (PNG vs. JPG) and all tiles vs. non-empty tiles. First five rows stand for visualizations
with 7 zoom levels; the last row corresponds to a 9 level genome browser.


be represented by a single background-tile. Coordinates of background tiles are exported at the
time of rendering and subsequently decoded by the Javascript implementation. Empty tiles are
usually compressed into smaller files by default (due to uniform coloring) and their number is
visualization dependent. Still, performance gains remain meaningful and typically grow considerably
with increases in a visualization’s zoom levels. Table 4.1 summarizes these improvements on several
of our visualizations.
   As mentioned in the previous section, interaction and data on demand rely on exporting addi-
tional information at rendering time that must be fetched and used by the browser visualization.
Loading this data at once, during initialization, can freeze the visualization and result in large mem-
ory loads. Instead, in line with the tile approach, the information should be split in multiple files
and retrieved only when an interaction demands it. For example, information about the shape of
the curves in the white-matter visualization is split over a 10 × 10 grid spanning the visualization.
Upon a mouse click, the corresponding cell content is fetched and tested for intersections. If an in-
tersection with a tract cluster is found, a file containing information about this cluster (e.g., cluster
trajectories for highlighting, metadata to be displayed in information pop-ups) is retrieved. This
ensures that visualizations remain responsive during interactive tasks.


4.3     Evaluation and Findings
An anecdotal evaluation of map visualizations yielded the following general feedback: in a consid-
erable number of tasks scientists liked lightweight, familiar visualizations; our visualizations were
deemed intuitive and easy to use; users pointed out the possibility of collaborative work; one lab co-
ordinator appreciated the ease of disseminating data and has decided to change his database-driven
distribution to our map approach; each specific visualization offers improvements over existing meth-
ods.
   Four proteomic researchers interested in T-cells and Mast cells, from two separate labs, evaluated
the protein network map. Four geneticists working with T-cells and NK cells evaluated the co-
regulation map, heatmap and the genome browser at the end of the development process. Three
                                                                                                   63


neuroscientists evaluated the brain projection maps. The Immgen project coordinator was consulted
throughout the development process and evaluated the genome browser and the heatmap during
implementation. One neuroscientist provided his input on the tractography maps. Below is a
summary of the feedback, followed by specific feedback on each of the five visualizations.


4.3.1     Evaluation Summary
All users rated ease of use as higher than other systems they have worked or experimented with.
They were excited to be able to run the visualizations in a browser and several stated that this makes
them more likely to use the visualizations. Most subjects said the available features are enough for
quick data analysis. The main workflow for the biological maps was to project genes or proteins of
interest onto existing maps. The neuroscience experts found the web interface with the digital map
interaction useful for both quick data inspection and collaboration. Most users were content with
the provided feature sets, interaction and visualization, while some asked for more hyperlinking and
metadata features. A majority of our subjects identified the static nature of the maps as a non-issue.
The Immgen project coordinator commented about the benefits of being able to accompany raw data
with relevant visualizations and the minimal overhead in both maintaining the map systems and
using them. He is actively considering switching the labs database-driven distribution system to our
map approach.


4.3.2     Gene Co-Expression Map
All subjects agreed that the co-expression map is useful. Three subjects would use the maps by
projecting their own genes of interested onto one or more cell spaces. Our fourth subject would also
look for global patterns of co-regulation, possibly over multiple maps, and suggested we link multiple
maps in separate browser tabs. One subject suggested using this application to create customized
datasets by selecting subsets of co-regulated genes from explored datasets.
   All subjects found the interactions intuitive and the maps easy to use. One of our subjects
thought data maps could be useful for researchers new to a lab since they could start analyzing data
right away. She then extended this idea to non-Immgen members and mentioned she would like
such visualizations to be present in other data sources as well. She added that her particular lab
has good technical support, but that since she is close to graduating and considering doing research
on her own, this approach seems very appealing.
   Two of our users did not consider the static nature of the maps a drawback. The other two
expressed the desire to customize the cell types over which genes are projected. However, they
agreed that there are relatively few cell subsets that they would choose from and that multiple maps
covering these possibilities would probably work. We note that our users were highly familiar with
the Immgen data and their analysis was in most cases past the exploratory stage, explaining the
desire for increased flexibility.
   In terms of features, two users explicitly complimented the superposed expression profiles, stating
                                                                                                     64


that they summarize data well and can guide exploration. All users were happy with the heatmap
upon-selection mechanism and with the ability to export selected sets of genes.


4.3.3    Gene Expression Heatmap
Our collaborator asked for browser-based heatmaps in the context of distributing Immgen data in
heatmap form to the immunologic community. His concern was that the currently deployed static
images are limited in exploring such representations, especially since overview analyses involve up
to 500x1000 matrices. Our fix of providing gene/cell axes coordinates with map zoom and pan was
deemed a solution to this problem.
   Of our four evaluation subjects, only one used large heatmaps as part of his analysis. He was
excited about our distribution mode and said it was a significant improvement over his current
analysis. He said the sticky axes made navigation much easier and complimented the ability to
select a rectangular region on the map to highlight genes and cell types. Commenting on this type
of visualization from the perspective of a user who is highly familiar with his data and would like
to build his own maps for analysis and publication, he said that the ability to share visualizations
via web-links makes collaboration easier.


4.3.4    Genome Map
The genome viewer benefited from iterative development and evaluations between implementation
stages. The initial insight was a need for an overview analysis of gene expression in the genome space.
Our collaborators’ assessment was that bringing forth correlated regulation of neighboring genes
would help gene-regulation variation with cellular differentiation be better understood. The specific
question was to what extent do genes that are adjacent share co-regulated expression patterns.
   A first design item validated during development was to not use data aggregation for different
zoom levels and to rely on additive visual cues of individual items. For instance, while individual gene
and cell expression values are not discernible at an intermediate zoom level, the average expression
and variability of neighboring genes remain salient. This proved effective: our collaborator noted that
this allowed him to identify inactive regions of the genome. For Immgen-specific cells, lymphocytes
and other blood-borne cells, these seemed to be primarily the regions carrying long clusters of
olfactory receptors.
   Another design choice was to introduce regional highlights that let users visualize areas in which
several genes meet a chosen pattern of expression. Contrary to his expectations, our collaborator
found that such regions proved relatively rare once he could get a whole-genome perspective. He
noted that genes with comparable patterns of activity tend to be dispersed, and that co-regulated
clusters exist but are relatively rare. He also noted a striking example of neighboring genes with
divergent patterns of expression in a few genes interspersed in the middle of the olfactory receptor
clusters, which are known to be quite active in lymphoid cells. He concluded that there is likely a
higher order of genomic organization that the genome map can help explore.
                                                                                                   65


   In our genomic evaluation at the end of development, results were less consistent. We believe this
is because the subjects’ interests were focused on specific genes of interest rather than on overview
analyses of regulation patterns. However, the workflow noted in the heatmap and co-regulation map
became immediately apparent in this visualization: users would load their own genes of interest and
see how they fit on a specific genome map. Both co-location and expression similarity would be of
interest to them.
   Our subjects were positive about the map approach. We switched from a standalone system
implementation to the web-map approach during development, responding to our collaborators’
need to publish the projects data on the web. The ease of use, the visual appearance, and the mode
of distribution were noted by several members of the Immgen project during our first release. The
static nature of the maps seemed to be non-issues for our collaborator and most of the subjects in
our final evaluation. One of our subjects expressed a desire to build his own maps, but agreed that
researchers who are not very familiar with Immgen data would find this a good first contact with
the data.


4.3.5       Protein Interaction Networks
Our subjects were excited about looking at large interaction networks in their browsers. The consen-
sus was that the browser setup is highly effective and that they would choose it over other systems
they are familiar with. They explained their choice by remarking that they don’t like to spend time
installing software and learning new features, and found the techniques we demonstrated intuitive
and easy to use. At the time of the demonstration the prototype did not provide sufficient access to
meta-data. That was the most common feature that was requested.
   As for the design decisions underlying the map visualization, relevance-based filtering and vertex
duplication, feedback was positive. The unanimous opinion was that relevance filtering was intuitive,
finding that it corresponded to how they normally approach a new network: identify important or
familiar proteins and then drill down to learn more about their neighbors. Another comment was
that seeing familiar proteins and connections early helps reinforce their confidence in the visual-
ization. None of our subjects thought that not seeing the whole network at once obstructed their
exploration, while one explicitly stated that the simplified view is superior to cluttered network
visualizations he has seen before. Three of our subjects were satisfied with how we currently com-
pute protein relevance, while one thought protein connectivity was enough since highly connected
proteins correspond to important proteins.
   In our first demonstration, hyperlinks to protein copies were placed directly on the map next
to protein glyphs. The first researcher we interviewed said he found the concept of split proteins
disorienting and would probably not be able to work with it. We then put the hyperlinks in a list
on the side of the display, as in the current prototype 4.5. The researcher thought this visualization
was improved because he could go through the copies of a protein systematically to reconstruct its
neighborhood. His final response was that splitting proteins is not desirable but is acceptable if it
can simplify the visualization. Our other three subjects stated that multiple copies of proteins would
                                                                                                    66


not get in the way of their analysis at all; one even said he preferred looking at proteins this way
because it made their interaction neighborhoods more apparent. Another subject made the point
that pathway drawings often had multiple copies of proteins. When we argued that these copies are
biologically motivated while ours aren’t he agreed but said they are still familiar with the technique.
Generally, it seemed that the primary task they need to perform on protein interaction networks is
finding all interactors of a protein. Our subjects thought that the list of protein copies, hyperlinked
to the map, allows them to do that without obstructing analysis.


4.3.6       DTI Brain Maps
This visualization was evaluated with a neuroscientist following an informal protocol. The static
map implementation of the projection-based DTI visualization was evaluated alongside a stand-alone
system implementing the same views interactively linked to a 3D stream-tube model. Our scientist
commented that he would primarily use the system because complex tract selections were required
that cannot be performed in the static map. However, he pointed out the unique opportunities
offered by the map implementation: collaborating with other scientists by sending links, being able
to look at datasets anywhere, any time, and browsing through datasets before importing a model into
the stand-alone application. He described the projection-based visualizations as intuitive, especially
compared to other dimensionality-reduction methods, and requiring no learning overhead. He also
appreciated the 3D poses of tract bundles and statistics available in the map info-boxes. This
evaluation led us to conclude that static maps are less suited for the 3D domain where complex
interactions are needed, but can occupy a task-specific niche such as collaborative work and casual
analysis.


4.4     Discussion
In this chapter we advocate for the dissemination of visualizations as large, static data maps with
a small set of intuitive interactions. Here we present a few general consideration pertaining to this
topic and a list of opportunities that map visualization opens.


4.4.1       General Considerations
As suggested by our evaluation, the low-overhead tile based approach we exemplify seems to be
particularly attractive to researchers lacking access to a strong computational infrastructure, for
unfamiliar datasets, and for casual data browsing. Our evaluation of the white matter visualization
shows that in other domains this approach might be more narrowly useful. From our experience,
the Google Maps API can also be a useful medium for gathering feedback on visual encodings,
possibly developed as part of another system. Collaborators are more likely to provide feedback on
visualizations that they can access and use with minimal overhead than on ones they must install and
learn. Furthermore, concerns such as deployment and platform, rendering speed and interactivity,
                                                                                                 67


GUI and data formats become non-issues.
   This work explores only the Google Maps API. However, we hypothesize that other Ajax tiled
approaches would probably also be suitable for this approach. More generally, zoom-and-pan frame-
works (e.g. Bing Maps API, Silverlight, OpenZoom) can be used in conjunction with a subset of
the desing elements discussed in this chapter to develop similar visualization. Moreover, the devel-
opment of a tiled frame-work designed to support data visualization rather than geographical maps
could prove useful. Such a framework, if open source, would also alleviate concerns about licensing,
support and stability associated with commercial products. Principles of sparsity and intuitiveness
should remain the foundation of tile frameworks, since the proposed browser visualizations should
not seek to rival complex systems.
   Our examples also demonstrate the synergy between maps and interactive web elements imple-
mented in Protovis. Focus+context visualizations can be created so that maps offer the context
while focus views are implemented in Protovis. However, we note that an essential guideline we
advocate is simplicity; merely replicating the complexity of stand-alone systems on the web is not
our goal.


4.4.2       Opportunities
We end the discussion with a list of opportunities for map-like visualization:
   Linking maps: A biological concept is rarely explained by a single perspective on the data,
so that linking multiple maps together can be beneficial. For instance, linking the genome map
to the 2D co-regulation map can be used to test the hypothesis that co-regulation has a genomic
location component. As indicated by one of our subjects, linking multiple co-expression maps
(e.g. for different cell families) can answer questions about conservation of gene function over
multiple conditions and would be a desirable addition to our framework. An initial prototype of this
functionality is shown in Figure 4.7
   Viewing maps on large displays: Most large-display setups have a way to display static
images. In this case, zooming would be performed by moving closer to the display. Limitations,
however, specifically on semantic zoom, must be imposed on the visualization.
   Collaborative work: During our evaluation the users were excited about the opportunities of
collaboration offered by maps. Exchanging interactive images rather than static ones and sending
links rather than datasets was positively received. This concept can easily be extended to support
collaborative work; the static nature of the visualizations are in this case an advantage. We would
like to add annotation capabilities to our maps to enable researchers to exchange ideas. The static
nature of maps is an advantage here too, since it ensures that each user has the same view of the
data and that shared comments target the same visualization elements.
   Easy instrumentation: An important component of visualization is understanding how visu-
alizations are used. Due to the minimal interaction advocated, maps should be easy to instrument.
In fact, one of our deployed maps has recently been instrumented using the Google Analytics frame-
work. User interaction capture will be implemented shortly.
                                                                                                     68


4.5     Concluding Remarks
A series of cognitive studies led Hegarty et al [77, 154] to conclude that “cognitive science research
indicates that the most effective visual representations are often sparse and simple. When given
control over interactive visualizations, people do not always use these technologies effectively or
choose the most effective external representations for the task at hand.”
   We presented a low-overhead approach that can facilitate browsing for a range of unfamiliar
scientific datasets, that relies on pre-computed visualizations carefully prepared by data experts for
distribution with sparse interactions, so that end users can access readily analyzed views of scientific
data. We build on the familiarity of the Google Maps framework and leverage its functionality
to distribute those views. In an anecdotal evaluation we showed that this data-distribution mode
is particularly suited for exploring unfamiliar datasets, for casual data analysis such as at home
or during commuting, and for lab biologists who lack access to strong computing infrastructures.
Additionally, we lay out design guidelines benefiting those wanting to create such visualizations and
we describe five concrete example visualizations.
Chapter 5

Improving Scientists’ Analytic
Strategies through User Interface
Changes

Sensemaking is a cyclical process in which humans collect information; examine, organize, and
categorize that information; isolate dimensions of interest; and use the results to solve problems,
make decisions and take action [32, 133, 135, 140]. Visualizations improve sensemaking by accel-
erating the search for information, facilitating the discovery of patterns, and providing means for
evaluating various hypotheses [32]. Traditionally, analysis was limited to few datasets and, in the
absence of high-performance visualization methods and systems, understanding and exploring one
dataset would play a major part in the analysis process. However, data gathering and visualization
have evolved to the point where, jointly, they can describe the behavior of systems with intricate
structures (e.g., biological systems). Such systems, and even their functional sub-units, are rarely
described by just a few pieces of evidence, whose discovery visualization may facilitate. In such
scenarios, the aggregation of individual pieces of evidence into high-level, experimentally testable
hypotheses represents the more significant proportion of the sense-making process.
   A research opportunity thus arises: designing methods and interfaces that work in concert with
visualization systems and allow researchers to aggregate their findings into cohesive scientific stories.
Catalyzed by growing intelligence needs after 2001, the new field of Visual Analytics (VA) emerged
out of traditional visualization efforts to address such problems. Illuminating the Path [165] intro-
duced and defined visual analytics as “the science of analytical reasoning facilitated by interactive
visual interfaces”.
   Here, we advance the VA agenda by providing experimental support for the following hypothesis:
we can use subtle changes in the interfaces of visual analysis systems to influence users’ analytic
behavior and thus unobtrusively guide them towards improved analytic strategies. An overview of
our results and methodology is shown in Figure 5.1.
                                                   69
                                                                                                     70


Figure 5.1: By making subtle, non-functional changes in the interface of an analysis support module
(top) we generated statistically significant changes in users’ analytic behavior in a visual problem-
solving task. A first set of changes nudged subjects to increase their use of the analysis module
by 39% (lower left, p = 0.02) in an attempt to support our subjects’ working memory. It also
caused them to switch among hypotheses 19% more often (lower center, p = 0.03), indicating more
consideration of alternative hypotheses. A second set of changes then led subjects to gather 26%
more evidence per hypothesis (lower right, p = 0.01). These three increases compare to smaller or
negative variations in a control group (+15%, −17%, −2%).


   This work was motivated by cognitive science research showing that human thinking is subject
to heuristics and biases that often lead to suboptimal decision making [76]. Simon [151] for instance
shows that humans are subject to “satisficing”, a heuristic that limits the search for possible hy-
potheses to the first that is good enough. Wason introduces the famous 2-4-6 study [174] which shows
that hypothesis confirmation is used instead of the rational strategy of hypothesis disconfirmation.
Such effects are not limited to laboratory studies or naive subjects but manifest in scientific research
as well [54]. Cognitive science research also shows that such biases can be partially overcome with
external or contextual help: Dunbar shows that peoples’ confirmation bias can be overcome, and
describes how analogy and unexpected findings often lead to consideration of multiple hypotheses
in scientific domains [54, 55].
                                                                                                   71


   Specifically, we report results from a controlled study in which subjects were asked to complete
three analysis sessions using a system consisting of a visualization and an analysis support module.
Two sets of non-functional changes were made to the analysis support interface before the second
and third sessions. These changes were designed to improve three hypothesized or observed an-
alytic deficiencies: analysts’ excessive reliance on memory, an inability to consider hypotheses in
parallel, and an insufficient search for evidence. Our quantitative results show that the interface
changes succeeded in alleviating these deficiencies. Compared to a control group, our test subjects
used the support module more, they switched between hypotheses more often, and they collected
more evidence per hypothesis. Our data doesn’t merely show that changes in interfaces translate
into different user behavior, but demonstrates that we can leverage interface design and cognitive
principles in controlled ways to overcome known analytic deficiencies.
   Our approach was inspired by two similar paradigms: Thaler and Sunstein’s work on libertarian
paternalism popularizes the notion of “choice architecture design” [164]; Fogg [64] introduces and
defines the concept of “Persuasive Technology”. Both approaches advocate for designing a “choice
architecture” or system’s interface such that users are “nudged” to make decisions in their and
the society’s best interests. We posit that this approach may facilitate the use of visual analytics
expertise to correct biases and heuristics documented in the cognitive science community.
   To the best of our knowledge, few concrete attempts have used visual analytics techniques to
align descriptive analysis (i.e., what people actually do to derive a solution), to normative analysis
(i.e., rational strategies of deriving the best solution), and none have done it using the “nudge”
paradigm. Instead, VA research is traditionally aimed at understanding and modeling the sense-
making process [23, 133] and operations that need to be supported [84, 138] or at creating systems
that offer ways of storing, annotating, exploring, and querying evidence sets. Such features support
analysis in the same way visualization does, by facilitating access to information, but they do not
necessarily structure analysis, a task still left entirely at the analyst’s discretion.
   We note that the aim of this work is not to introduce novel analysis support features or interface
design guidelines, but to quantitatively measure the ability of a small set of such elements to nudge
users towards normative analysis practices.


5.1      Related Work
In this section we show how our work is motivated by and relates to existing research. We start with
an overview of analytic biases and heuristics. We continue with a description of previous work that
inspired our approach: libertarian paternalism and persuasive technology. We end by illustrating
how our results relate to and advance current visual analytics research. We note that the results
presented in this chapter have been published in [93].
                                                                                                      72


5.1.1    Limitations of Human Analysis
Humans are prone to a range of decision making biases and heuristics which can occasionally result in
sub-optimal results [76]. A specific manifestation of such effects occurs in the context of hypothesis
driven analysis. Simon [151] coins the term satisficing, a heuristic that limits analysis to a hypothesis
that is good enough. Bruner and Potter [29] show that subjects cling to initial hypotheses and are
unable to consider alternative explanations in an experiment involving image-slides with varying
degrees of focus. More recently, Danner et al [44] show that three or more retrievals from memory
of a specific means towards a goal will succeed in inhibiting competing means for the same goal.
Finally, multiple studies have shown that the use of a single hypothesis leads to a bias in the way
subjects evaluate evidence [111, 35].
   Biases are also present in gathering and evaluating evidence pertaining to a hypothesis. According
to the scientific method, the best way to test a hypothesis is to attempt to disconfirm it. However,
researchers have found that subjects usually try to confirm their hypotheses rather than disconfirm
them. That is, subjects will choose experiments that generate results predicted by their hypotheses.
This is known as confirmation bias and is eloquently demonstrated in Wason’s card test [174].
   Many of the studies mentioned above have been conducted with naive subjects. As such, a
question about the degree to which these observations hold in scientific or clinical reasoning remained
open. Several studies show that such biases and heuristics manifest in the scientific and clinical
domain as well, albeit perhaps to a lesser degree. Dunbar [54] used a scientific setting, replicating
the discovery of a real biological finding, to demonstrate that single-hypothesis and confirmation
strategies were predominately used, and that such strategies inhibited the subjects to replicate the
scientific discovery. Similarly, Ben-Shakhar et al. [20] showed that clinicians at a Jerusalem hospital
showed strong agreement with a priming suggestion when deciding on a diagnosis. Finally, Rodgers
and Hunter [139] found that researchers investigating a favored hypothesis selectively deleted studies
from a meta-analysis. On the bright side, Klayman argues that even though a confirmation bias
exists, under certain circumstances, this is a good strategy to use [104]. Furthermore, in an “in
vivo” study involving observations of scientists performing everyday analysis, Dunbar [55] found
that while subjects do try to confirm hypotheses, their hypotheses will often change in the face of
inconsistent findings.
   There is significant evidence that analysis can be improved by using prescriptive analysis tech-
niques, training in normative thinking, and external aids that amplify cognition. For instance, it
is thought that bounded memory and attention inhibit the analyst’s ability to consider multiple
hypotheses. The distributed cognition theory suggests that individuals use their environment as an
external aid to amplify their cognition. Clark and Chalmers [39] point to research in support of
this theory. In visualization, authors in [32] state that explicit visual thinking increases an analyst’s
cognitive span. Dunbar’s [54] study demonstrates that indeed if subjects were conditioned to pursue
alternative hypotheses and disconfirming evidence, solutions to a scientific puzzle were reached more
often. Dunbar [55] also describes how analogy and a string of unexpected findings can often lead
to consideration of multiple hypotheses and novel findings. Elstein et al [61] describe a study that
                                                                                                    73


reveals that medical students using hypothesis-driven analysis outperformed those that used a data
immersion approach. Scientific studies mentioned in [76] show how multiple attribute utility theory
(MAUT) can reduce biases and heuristics such as the prominence effect that causes subjects to base
a decision on a single attribute which they consider most important. Finally, literature rooted in the
field of intelligence analysis provides anecdotal evidence as to the benefits of applying prescriptive
techniques and algorithms to complex analysis tasks [97, 79, 76].
   Visualization literature also points out that analysis can be improved by using visual aids. In [50]
it is shown that subjects given a problem statement in visual form perform better than subjects
given the same information in textual form. Savikhin et al [144] uses a specific example from
economic reasoning to prove that visualization can help overcome heuristics used in decision making.
In [23, 133] the authors call for augmenting the analysts working memory to increase the attention
span for evidence and hypotheses and improving divergent thinking by encouraging users to consider
alternative hypotheses.


5.1.2    Guiding User’s Choices: Nudges and Persuasive Technology
Thaler and Sunstein’s work [164, 159] in the field of behavioral economics popularized the term choice
architecture — how a set of choices is presented to a consumer — and the concept of libertarian-
paternalism — designing choice architectures that “nudge” consumers towards making decisions in
their own interest (paternalistic) while unrestricting choice (libertarian). A similar concept was
proposed in the HCI domain by Fogg [64] who defines persuasive technology as “interactive informa-
tion technology designed for changing users’ attitudes or behavior”. More recently, Lockton [109]
generalized Fogg’s persuasive technology and linked it to Thaler and Sunstein’s choice architecture
model by introducing the “design with intent” concept. Broadly, this refers to design intended to
guide user behavior across a range of disciplines from architecture to software. We build on these
previous approaches and demonstrate empirically how the nudge paradigm can further the visual
analytics agenda.
   Sunstein and Thaler as well as Fogg motivate their approaches with two arguments, which they
support with experimental evidence. First, any choice architecture or computer interface necessarily
influences decision-making behavior, whether intentionally or not. This statement is demonstrated
by studies showing how potentially unintentional choice designs, such as state-by-state opt-in versus
opt-out organ donation programs, significantly impact people’s choices. Second, as shown previously,
ample research indicates that people’s choices and behaviors are not necessarily aligned with their
goals. Ill-formed preferences, default rules, framing effects, and starting points, all dominate impor-
tant decisions and thinking processes. From a visual analytics perspective, if an analyst’s objective
is to select the optimal course of action based on available data, cognitive biases and heuristics can
steer him towards erroneous results.
   Both approaches have inspired scientific results that validated their feasibility. Thaler and
Bernartzi [163] use an array of cognitive effects such as mental discounting (i.e., weighing current
events more than future events) or default options to persuade employees of a company to increase
                                                                                                  74


their contributions to their retirement plan. In the technological realm, the enhanced speedome-
ter [108] changes visual appearance based on the current speed limit (when known) encouraging
users to stay within speed limits, while the smart sink [13] augments a normal sink with visual cues
that make energy consumption apparent. These works have provided inspiring design models for
the analysis nudges presented in this dissertation.
   Finally, Thaler, Sunstein and Fogg, as well as subsequent research articles, address ethical ques-
tions raised by influencing choice or behavior. Thaler’s view is that paternalism is unavoidable
and that libertarian paternalism should ensure, as a general rule, that people can easily avoid the
paternalist’s suggested option. Fogg proposes a thorough investigation of the gains and losses of
all parties involved in the development, distribution and use of a particular system to determine
its ethicality. Subsequent papers augment Fogg’s persuasive technology approach with additional
ethical constraints or guidelines. According to Oinas-Kukkonen [126], persuasive systems may be
defined as “computerized software or information systems designed to reinforce, change or shape
attitudes or behaviors or both without using coercion or deception” and proposes that persuasion
should always be open and unobtrusive. It also disputes some of Fogg’s initial design suggestions,
such as surveillance and conditioning as ethically unacceptable. The nudges used in our work abide
byf these ethical principles.


5.1.3       Supporting Analysis through Visual Analytics
The work presented in this chapter extends the VA research agenda which focuses on designing
interfaces and visualizations that support the aggregation of data insights into cohesive scientific
theories.
   Work in the field of VA falls broadly into two categories: theoretical research based on exis-
tent psychological studies or user evaluations, and applicative work. In the theoretical domain,
work in [23, 133] presents a five stage sense-making model derived through Cognitive Task Analysis
(CTA) and verbal protocol experiments with analysts to identify leverage points for visualization.
Authors in [84, 138] analyze how users synthesize multiple collections of evidence in a collaborative
setting, using a physical, visual medium. Their results, a break-down of analysis tasks with observed
frequency/duration and insights into the workflows of collaborative sense making, are useful for de-
ciding which analysis tasks to support. Finally, multiple position papers advocate for leveraging the
expertise of cognitive science and intelligence communities in the context of visualization supported
workflows [73, 158]. Our work is tangential to and motivated by such results.
   At the opposite spectrum, new applications probe the feature and design space of analysis-support
software. Several applications for thought mapping and evidence management use the paradigm of
laying out reasoning artifacts on a canvas, either freely or as a tree/graph structure. Examples of
such systems are: The Concept Maps [31], MindManager [115], The Analyst’s Notebook [124], Visual
Links [83], The Scalable Reasoning System (SRS) [132] and The nSpace Sandbox component [175].
Several systems depart from the canvas paradigm. Entity Workspace [22] operates only on textual
evidence and uses grouping and linking as an organizing paradigm in a highly structured medium. In
                                                                                                      75


HARVEST [71] users can not only visualize existing information, but also construct new analytical
knowledge from existing information and use visualization on it. Authors in [177, 178] apply similar
principles to multi-dimensional visualizations and use specific visualization characteristics to drive
the organization of evidence. Finally, work in [59] departs from conventional methods by structur-
ing analysis as short stories hyperlinked to evidence, a paradigm based on a narrative theory [63]
suggesting that people are storytellers and excel at evaluating a story for consistency, detail and
structure. In our evaluation we use design elements which we distilled from these existing analysis
systems.
   To the best of our knowledge, few concrete attempts have used visual analytics techniques to
bridge the gap between descriptive analysis and normative analysis. As such, our work complements
current research by using a visual analytics methodology to create a link between observed analytic
deficiencies and corrected behavior. Perhaps closest to the work presented here are results by
Savikhin and Maciejewsk [144] who demonstrate experimentally that a targeted visual representation
can induce normatively correct decisions in an otherwise biased economic choice task. We extend this
result by linking it to the more general nudging approach proposed by Sunstein, Thaler and Fogg,
by using interface design in general, and by providing an experimental validation on a high-level
analytic task.


5.2        User Study Design
We conducted a controlled user study to test our hypothesis that small changes in a visualization
system’s interface can be used to produce targeted modifications in users’ analytic workflows. This
section presents the design of this study. We start with an overview description of the methodology
used and continue with an in-depth presentation of each aspect of the study.


5.2.1      Study Overview
Subjects completed an analysis task inspired by a real scientific problem using a visualization and
an analysis support interface (Figure 5.1, top). Each subject performed three such analysis sessions
at one week intervals. Each session lasted roughly one hour.
   Thirty-six subjects, mostly undergraduate and graduate students, were divided into two groups:
21 test and 15 control subjects. The control group solved all three tasks using the same analysis
support interface. Conversely, test-group subjects were given slightly different versions of the analysis
support interface in each session. Specifically, two sets of interface nudges were added to the analysis
system before the second and third sessions. We hypothesized that, while changes between sessions
would be observed in both groups due to task-learning effects, the test group would exhibit additional
effects due to the interface nudges.
   The analysis task was inspired by the proteomic domain: finding causal paths in protein inter-
action networks to explain the interdependency of pairs of proteins that are not directly connected.
None of the subjects was familiar with the task or background material beforehand. They were
                                                                                                  76


given a 20 minute tutorial at the beginning of the study. Our approach of using a relevant scientific
setting and naive subjects was inspired by a study by Dunbar et al [54].
   Our test system was instrumented to automatically log users’ interactions. Subjects were also
asked to distill their analysis in a written questionnaire at the end of each of the three analysis
sessions. We analyzed the datasets both quantitatively, to find support for our nudging hypothesis,
and qualitatively, to gain insight into how subjects approached their task.


5.2.2    Task Description
Subjects were asked to solve three artificially constructed analysis tasks inspired by workflows used
by proteomic researchers studying protein signaling pathways.
   Proteins are functional molecules within cells. They interact with one another forming com-
plex causal pathways that determine the response of cells to events. Such protein interactions are
the object of intense scientific research because understanding cellular pathways would allow re-
searchers to devise efficient drugs that can influence a cell’s behavior without causing unwanted
side-effects. Proteomicists often use visualizations of interaction networks to understand changes
in protein activation patterns measured in proteomic experiments. A distinct class of experiments
is knockout-experiments: here researchers deactivate particular proteins, and compare protein ac-
tivation levels before and after the removal. A more detailed description of protein interaction
mechansisms, experimental techniques and visualization methods can be found in Chapter 3.
   Our subjects were given network visualizations that were said to depict protein interactions
documented in recent publications. Figure 5.1 shows one of three distinct networks that subjects
were asked to analyze. The networks were manually created and laid out. The familiar Google Maps
interface was used to display the network images and offer basic interaction. Clicking on nodes or
edges opened information bubbles referring to these particular elements. Interactions were described
by short, fictional paper abstracts detailing the particularities of each interaction and the context
in which it was discovered.
   Subjects were told that a knockout experiment had been performed on a specific type of cell. They
were informed that a protein was removed from the cell and that researchers subsequently observed
changes (positive or negative) in the levels of several proteins. These changes were marked on the
network with arrows. Finally, subjects were asked to use the available information to determine
network paths likely to have produced those changes, and to rank them based on their plausibility.
This network task represents a visual, complex, and open-ended implementation of causal reasoning
tasks which have been typical choices of cognitive studies [174].
   Our networks used proteomic terminology but introduced fictional proteins, interactions and
interaction mechanisms. Thus, the probability of a regulation chain was determined by the logical
consistency of the presented evidence. The key rules that subjects were expected to extract from
the evidence and use in their analysis were: the probability of a depicted interaction is lower if it
was documented in species and cells other than those investigated in the knockout experiment; a
                                                                                                     77


correlation between two proteins should be treated as an edge with uncertain directionality; inter-
actions could describe direct or inverse regulation mechanisms; and the edges sequence in a solution
path should correctly explain the sign of the observed change. These assumptions, along with a
general description of protein signaling, were illustrated in a 20 minute tutorial (text and video) and
were clarified further on request. Moreover, essential terms were highlighted in all evidence text and
in-situ explanations were displayed upon mouse clicks (Figure 5.1).
   The order in which the three networks were presented to users was alternated to minimize the
chance of network differences influencing the global result. Thus, in the test group, six subjects
solved the networks in order 1, 2, 3, seven subjects solved them as 2, 3, 1, and the remaining six
solved them as 3, 2, 1. A similar division was used for the control group.


5.2.3    Analysis Interface and Evaluated Nudges
In addition to the protein network viewer, an analysis support interface augmented the experimental
environment (Figure 5.1). As noted in Section 5.2.1, control subjects used the same base analysis
interface in all three sessions. Test subjects started with an identical interface but were then exposed
to upgraded versions in the second and third session. These versions, obtained by incrementally
including two sets of evaluated nudges in the base interface, are shown in Figure 5.2.
   The base analysis module contained three lists in which users could store their hypotheses, con-
firming, and disconfirming evidence. Hypotheses were entered into the system as noncyclical network
paths by clicking on sequences of connected nodes. Evidence was inserted into either the confirming
or disconfirming category by typing free text in a pop-up box. Selecting existing hypotheses would
highlight their corresponding paths on the visualization and display their associated evidence, thus
allowing subjects to revisit and compare hypotheses. Subjects were familiarized with these features
in the tutorial video at the beginning of the study.
   Three nudges were designed to alleviate three analytic deficiencies. First, we assumed that sub-
jects would rely on their working memory rather than use the analysis system. Second, based on
cognitive science studies, we assumed that subjects would have trouble considering multiple hypothe-
ses in parallel. Third, we hypothesized that subjects would gather mostly confirming evidence for
their hypotheses, and ignore the aspect of disconfirming evidence. Results of our initial session one
runs caused us to adjust our last assumption: subjects were gathering approximately equal amounts
of confirming and disconfirming evidence but in overall small amounts. We refined the design of our
last nudge to better target this issue.
   As already noted the first evaluated nudge (Figure 5.2, left) aimed to increase users’ reliance
on the analysis module. Our design rested on the assumption that if subjects knew other users were
actively interacting with the module they would do so as well. To test this assumption, a section
listing online users was added to the base analysis module. As users interacted with the module, this
was reflected in a publicly visible status message (e.g., “user is browsing his hypotheses”, “user has
entered new evidence”). Fake user-bots were added to ensure that a nudge factor was at all times
present. This design was inspired by research on conformity effects and motivational factors for
                                                                                                  78


Figure 5.2: The two modified analysis interfaces include three evaluated nudges: a box listing online
users actively interacting with the analysis module (left), a color gradient (white to gold) shows
recently analyzed hypotheses (left), and a redesigned, larger, evidence box asks users to commit to
the implications of a hypotheses not having associated evidence (right).


online contributions. Specifically, humans change their behavior to match that of others [38, 14, 36]
to gain social approval [49], or because they derive utility information from observing what others
do [157, 101]. In addition, visibility and status recognition encourages users of social networks to
increase their online contributions [10, 125, 125].
                                                                                                     79


   The second nudge was designed to encourage users to compare and contrast hypotheses in par-
allel rather than perform a sequential search in hypothesis space. Initially we planned to evaluate
this nudge by itself but ultimately merged it with the first one to make the length of the study
manageable. The design involved assigning each hypothesis a recency score that decayed over time
but increased with any interactions targeting the hypothesis (e.g. selection, adding evidence). Re-
cently active hypotheses were highlighted in the hypotheses list by using a color-gradient based on
the recency score. Finally, thresholding the recency score allowed us to determine the number of a
user’s active hypotheses, display this information in the user status (Figure 5.2, left), and sort users
based on how many hypotheses they were investigating. This offered a visual and status reward.
While a user could trick the system by quickly switching between hypotheses, this was accounted
for during the data analysis stage (see Results section) and we have observed just two intentional
instances of it. These first two nudges were integrated into the analysis interface before the second
session.
   Finally, the third nudge was deployed before the last session and aimed to encourage test
subjects to gather more evidence for the same number of hypotheses. To that end we modified the
evidence collection part of the interface (Figure 5.2, right). First, the evidence collection area was
more visually interesting and distinct from the rest of the interface. Second, if no confirming or
disconfirming evidence had been entered for a hypothesis, the evidence boxes would read “0 chances
that hypothesis is false” or “hypothesis is unlikely”. This essentially required subjects to commit to
extreme cases something that humans are known to avoid [52]. Third, an unintentional modification
that we introduced while implementing the design was that the evidence boxes in this nudge were
larger than in the base interface.
   We hypothesize that this nudge could be restricted to disconfirming evidence only, in which case
it could potentially alleviate confirmation biases [174]. As noted, our subjects did not exhibit a
confirmation bias in the early stages of the study, so that we resorted to testing the more general
case of increasing the amount of total evidence.


5.2.4      User Pool
Our study included a total of 36 subjects. Of these, 16 were women and 20 men. Six of them
were young professionals, 18 were undergraduates, and 12 were graduate students. In terms of field
or major, 26 of the subjects were active or majoring in sciences and engineering, while 10 were
humanities students. None of the subjects had previous experience with proteomic analysis. As
such, all subjects relied solely on the tutorial provided at the beginning of the study.
   Subjects were randomly distributed in control (15) and test (21) groups such that the two groups
had similar distributions of gender and age (undergraduate, graduate or postgraduate). Subjects
were compensated for their participation.
                                                                                                    80


5.2.5    User Study Limitations
Ease of hypothesis elicitation: A pilot run showed us that free-text specification of hypotheses would
have lead to considerable variability in what users entered as hypotheses. To be able to compare
results across subjects we limited hypotheses to paths of connected proteins. This interaction mode,
reinforced by the tutorial video, gave subjects an easy “recipe” for generating hypotheses: any
network path represented a valid hypothesis.
   Lack of motivation: Our study did not involve monetary incentives to encourage subjects to
provide valid solutions. As a result, several subjects appeared not to devote significant effort in
searching for clues beyond those immediately noticeable.
   Unforeseen problem solving strategies: A few of our early subjects copied the network on paper
and annotated each interaction and protein. This strategy is not scalable to real protein interaction
networks and it does not capture the exploratory nature of analysis. To avoid this we instructed the
rest of the subjects to not use such exhaustive analysis strategies.
   Varying degree of task difficulty: One of the three networks was perceived by several users as less
difficult than the other two. We anticipated this problem and alternated the order in which tasks
were presented to subjects.
   Task misunderstanding: Instead of constructing short paths that linked the knockout protein to
each changing protein, two subjects looked for long paths that linked the knocked-out protein and
all arrow-proteins (i.e., proteins with changed levels) together. We used these results because the
subjects used this interpretation consistently in all three sessions.
   Visualization too limited: Several subjects expressed the need for a more feature-loaded visual-
ization and two of them changed their analysis strategies between sessions to accommodate their
requirements. Specifically, the two subjects realized that one interaction can be in multiple hypothe-
ses. As a result, one used pen and paper to do her analysis at interaction level while one added single
interactions as hypotheses and then assembled those into higher level paths. For one of the subjects
we discarded the last dataset, while for the other we interpreted the data such as to reconcile the
two strategies.
   Analysis times varied:     We urged users to spend approximately 60 minutes on each session.
Several subjects however insisted on finishing earlier. Moreover, some datasets showed prolonged
intervals of inactivity and several users were observed to take web-browsing or texting breaks. In
our analysis we eliminated intervals with no activity and normalized all measurements by the time
spent on the task.
   Low number of subjects: Our sample size was relatively low for the open ended tasks our study
involved. However, we note that the trends in the data became apparent with as few as six users in
each group and changed very little throughout the experiment.
   Effect of change is not captured:     Our study does not capture the amount by which interface
changes amplify the saliency of our nudges. It may well be that nudges are less observable and
effective if they are introduced into the first system release.
                                                                                                       81


5.3      Results
Here we describe quantitative and qualitative results from our user study. All data and analyses are
available online [183] and have also been published in [93].


5.3.1     Data Preparation and Analysis
Thirty-two subjects completed all three sessions while four completed only the first two for a total
of 28 ∗ 3 + 4 ∗ 2 = 104 datasets. Four of the subjects, two from each group, solved the tasks on paper
using exhaustive annotation of the networks. Three additional users also switched to this approach
in the final session. All these data were discarded from the analyses leaving 104 − 4 ∗ 3 − 3 ∗ 1 = 89
datasets from 13 control subjects and 19 test subjects.
   We measured and analyzed three quantitative indicators to support our nudging hypothesis.
First, we recorded the number of hypotheses and evidence entered into the system as a proxy for the
degree to which subjects’ relied on the interface to trace their analysis. This number was normalized
by the time, in minutes, subjects spent on each session. Second, we measured the number of times a
subject switched between hypotheses and normalized it by the number of hypotheses, as an indicator
of the degree to which hypotheses were analyzed in parallel during analysis. Third, we recorded the
number of evidence-items collected and divided it by number of hypotheses.
   In the case of hypotheses switches we ignored selections lasting less than 5 seconds because we
observed that users sometimes cycled rapidly through hypotheses as a method of gaging progress. We
also ignored switches occurring in the last part of the analysis while subjects were filling in the answer
questionnaire. We found that by default most users did a comparative analysis of hypotheses at the
very end. Our nudge however was designed to encourage users constantly to consider alternative
explanations.
   In a second phase we also performed a qualitative analysis of our subjects’ workflows. Our goals
were to understand the dominant analytic strategies and behavioral patterns, and to verify the
degree to which biases and heuristics applied.


5.3.2     Quantitative Support for Nudging Hypothesis
The premise of our experiment was that interface nudges would cause test subjects to change their
behavior between sessions differently from how control subjects’ behavior would evolve naturally
as a consequence of learning or boredom. Figures 5.3-5.5 demonstrate the validity of our premise
by contrasting the relative changes in performance measures between consecutive sessions in both
experimental groups. As expected, change was negligible in control subjects (means of all triangles
are close to one), but was significant for test subjects when a nudge was present (means of black
squares higher than one). However, test group behavior remained constant whenever performance
measures were not specifically targeted (e.g. change in contributions between the last two sessions).
This suggests that subjects were not simply responding to interface changes but to nudges targeting
particular performance measures.
                                                                                                    82


Figure 5.3: Changes between the first two sessions (black) caused test subjects (square) to increase
the number of hypotheses and evidence items entered into the analysis system by an additional 24%
over the control subjects’(triangle) relative increase. The interface changes made before the third
session did not have a significant impact on this performance measure (grey).


Figure 5.4: Changes between the first two ses-       Figure 5.5: Changes between the last two ses-
sions caused test subjects (square) to increase      sions (black) caused test subjects (square) to
their switching between hypotheses by an addi-       gather 24% more evidence for their hypotheses
tional 35% over the control subjects’(triangle)      as opposed to a constant evidence/hypotheses
relative increase.                                   ratio (-2%) between all consecutive control ses-
                                                     sions (triangle). Changes in the test group be-
                                                     fore the second session (gray) produced non-
                                                     signficant changes in the evidence collection as
                                                     compared to the control group.


   Test subjects contributed 39% more hypotheses and evidence items to the analysis module in
the second session than in the first. This compares to an increase of only 15% in the control group
(Figure 5.3). A t-test found this difference to be statistically significant (t(29) = −2.07, p = 0.02).
Contributions remained close to constant between the second and third sessions in both the control
and the test group (Figure 5.3). This conforms to the expected behavior since no nudge targeting
contributions was added between these sessions.
                                                                                                      83


   The difference in switches between hypotheses was an increase of 18% in test subjects versus a
decline of 17% in control subjects (Figure 5.4). The difference was significant, as indicated by the
t-test (t(25) = −1.89, p = 0.03). The first two nudges were both added before the second session.
Thus, we cannot assign either of the observed changes to any single nudge but to all interface changes
made between the first two sessions.
   The amount of evidence collected per hypothesis remained fairly constant between sessions in
the control group with a decrease of 2% (Figure 5.5). Test subjects however, gathered on average
24% more evidence per hypothesis in the third condition than the second. This difference was also
found to be statistically significant (t(38) = −2.28, p = 0.01).


5.3.3       Qualitative Analysis of Subjects’ Workflows
Our subject’s logs allowed us to qualitatively assess their workflows to extract common strategies
and to determine the extent to which subjects rely on analytic biases and heuristics. The following
paragraphs summarize our conclusions.
   Observed workflows: More than half our subjects started with an initial exploration of the
network. This exploration was not hypothesis driven and typically lasted between three and six
minutes. Subjects then moved on to a hypothesis driven analysis, trying to connect “arrow” proteins
to the knockout protein (Figure 5.1). We could discern two strategies for entering hypotheses. Most
subjects would pick a candidate path, do a pre-evaluation of its likelihood, enter it into the system
provided it was plausible, and then follow with a second pass to summarize and document evidence.
These users would often revisit hypotheses and compare them. A few subjects added hypotheses
without prior exploration and then summarized evidence in a following pass. Generally, they did
not reevaluate those hypotheses again until a final pass when they decided on a global likelihood
ordering.
   Observed biases and heuristics: We also analyzed the data in terms of biases and heuristics
documented in cognitive science literature. In particular we looked for confirmation bias, single-
attribute analysis (i.e., focusing on a single most prominent attribute and using that to rank options),
conjunction fallacies (i.e a specific condition is deemed more likely than an ecompassing general one),
and inabillity to operate with varying degree of probabilities. Our findings are interesting because
they show that some of these effects are not as dominant in a close to real analysis setting as cognitive
science suggests and because they describe potential scenarios that trigger such behavior.
   A first interesting finding was that confirmation bias is not dominant. In fact, subjects gathered
slightly more disconfirming evidence than confirming evidence. Moreover, several users gathered
almost exclusively disconfirming evidence, while others pruned paths that had strong negative ev-
idence. Also, one subject would copy entire sections from the information bubbles and enter it
directly as confirming evidence, but would always carefully summarize negative evidence. This sug-
gests that she recognized the higher diagnostic value that the disconfirming evidence would have in
her final ranking.
   A known heuristic that we found in several datasets was single-attribute analysis (i.e., focusing
                                                                                                  84


on a single most prominent attribute and using that as a means of ranking options). We noticed
several cases in which subjects added complicated paths before shorter, more intuitive ones. On a
closer inspection we found that they had selected a single attribute (e.g. cell type) and were using
it to include or discard paths from their analysis.
   We also noticed an inability to operate with varying degrees of probability. Several subjects
seemed to postpone the consideration of paths involving a complex probability judgment (e.g. mul-
tiple interactions with associated uncertainty) and instead concentrated on paths that allowed a
binary decision.
   Our network setup was well suited to discovering conjunction fallacies which occur when a specific
condition is deemed more likely than a general one. In our network task short paths should be more
likely candidates for analysis than longer paths. In general, our subjects seemed to be aware of this
principle. In fact most new hypotheses abided by this rule. Additionally, several subjects added the
short length of a path as positive evidence. However, we noticed that subjects’ analytic strategies
tricked them into the conjunction fallacy in a significant number of cases. We observed three main
scenarios leading to this.
   First, the favored method of expanding ones set of hypotheses was to modify an existing one
by rerouting part of its path. At the very least subjects would use interactions that they were
already familiar with. Most subjects avoided picking completely new routes, especially in network
areas where they had already done some analysis. Such small changes to initially short paths lead
subjects to analyze increasingly longer paths. Ultimately, subjects spent considerable time on long
paths that were less likely than unexplored shorter options.
   Second, subjects occasionally considered longer paths that linked multiple arrow-proteins to-
gether more likely than short paths from the knockout protein to each of those arrow- proteins. We
hypothesize that users were looking for good unifying stories, a known cognitive tendency. Interest-
ingly, one of the subjects confessed that he was aware of the conjunction fallacy but that the “story
was too good” to be irrelevant.
   Network layout: The third reason for multiple instances of conjunction fallacy is tied to the
network layouts. The way paths were displayed visually had a significant impact on which ones
were chosen for analysis. The majority of subjects preferred paths that described fairly continuous
visual arches, or that were symmetric with ones they had already looked at. Sharp-angled paths were
usually selected last even if they were shorter than already analyzed hypotheses. Another interesting
effect observable in several datasets was that symmetrical paths were more often compared to each
other than to other hypotheses.
   Interestingly, the detrimental effect that the network layout had on our subjects’ elicitation of
new hypotheses is amplified if we consider that those hypotheses will be further expanded as noted
above.
                                                                                                       85


5.4      Discussion
In this section we discuss the broader impact of our contributions, alternative methodologies, and
open questions.


5.4.1     Significance
Some of the findings reported in this chapter may seem unsurprising. That interface design can
alter analytic workflows is evident, as is the fact that online visibility is correlated with increased
online activity [125]. However, our study data doesn’t only show that changes in interfaces translate
into different user behavior. Our contribution lies in demonstrating that interface elements can
be leveraged in controlled ways to unobtrusively correct users’ strategies: our subjects’ deficiency
in supporting their hypotheses with evidence was observed in the first session and alleviated by a
redesigned analysis support module in the third session. We believe this approach is valuable because
it has the potential to correct and improve users’ strategies without having to rely on coercive or
obtrusive elements such as pop-up messages or help-agents.


5.4.2     Applicability
The analytic biases and heuristics targeted in our study were chosen because they are amply docu-
mented in the cognitive science literature. It is likely that one or more of these effects do not manifest
or are beneficial in some areas or settings. In fact naturalistic decision making [105], a distinct re-
search area, models situations (e.g., crisis control, time-sensitive operations) in which heuristics are
an efficient analytic strategy.
   The aim of this study was not to eliminate a specific set of biases and heuristics but to demon-
strate that once such effects are identified we can use interface elements to reduce their occurrence.
For example, we posit that in settings typically modeled by naturalistic decision making, heuris-
tics are part of the rational analytic model. As such, excessively deliberative and time-consuming
analysis could be considered erroneus or suboptimal and discouraged through the use of nudges.


5.4.3     Design Guidelines
Our work was primarily aimed at providing experimental support for applying the nudge paradigm
in the visual analytics domain rather than providing a set of design guidelines. The nudge design
space warrants a more exhaustive exploration because it can either provide a tool for guiding users
towards better analytic strategies or help us understand how our interfaces unintentionally shape
users’ exploratory and analytic patterns. Our work exposed us to interesting questions about the
ability and degree to which tutorials, ways of entering and storing hypotheses, and even simple
design choices such as text-area size and color can influence users’ behavior.
   A few loose design guidelines can be distilled from our work however. First, placing collaborative
elements and conformity triggers in analysis systems can nudge users to change their behavior.
                                                                                                        86


We hypothesize that artificial “model-analysts”, like we used in our experiment, could nudge users
towards conforming to a desired behavior. Second, visual rewards, such as our recency score, will
encourage users to consider options in parallel if this is desirable. Third, messages in text-areas,
perhaps in conjunction with box-size, may be employed for boxes that should not be left empty.
Finally, based on the qualitative analysis of our subjects’ workflows we hypothesize that ways of
automatically suggesting hypotheses may alleviate some of the observed conjunction fallacies and
that subjects would benefit from support for multiple attribute analysis. Both such mechanisms
would need to be domain specific and are beyond the scope of our work.


5.4.4     General Considerations
The data distributions may suggest that nudges, rather than uniformly targeting all subjects, tend
to be particularly effective for a subgroup and less so for the rest. As seen in Figures 5.3-5.5,
measurements obtained from test subjects appear to form two clusters: one with values similar to
those measured in the control group, and one with distinctively higher values. These clusters do not
correlate with the order in which networks were presented to users. However, the data gathered as
part of this study is insufficient to test this hypothesis.
   Our study did not replicate several biases and heuristics documented in the cognitive science
literature. Most notably, humans are thought to be unable to elicit many hypotheses and to be
biased towards gathering predominantly confirming evidence. Conversely, our subjects generated
many hypotheses and showed no confirmation bias. We see two possible explanations for this. First,
two of the study limitations may be responsible: the ease of generating hypotheses and the lack
of subjects’ motivation lead them to pursue multiple hypotheses and not develop attachments to
favored ones. An alternative explanation is that humans are able to switch from a normal working
mode to an analysis mode in which normative principles are more carefully observed. Research by
Dunbar [55] hints at this hypothesis.
   This latter possibility supports our choice of analysis task. Shorter and more focused tasks like
the ones used in many cognitive experiments can be applied to large numbers of users and provide
clean data. It is not clear, however, to what extent they translate to the exploratory analysis typical
of scientific discoveries. As noted in the related work section several science studies indicate that
there are observable differences between laboratory settings and real scientific or clinical situations.
   Similarly, our study might have been more informative had we tested domain experts in their
field of research rather than naive users on unfamiliar tasks. It remains uncertain whether domain
experts, who generally follow well established workflows, can be nudged as easily as our subjects.
Moreover, a high familiarity with an analysis system may also cause expert subjects to overlook new
interface nudges.
   Unfortunately, domain experts are scarce and the variability in the scientific problems they solve
is high. Thus, quantitative studies that faithfully replicate real life scientific settings are improbable.
Our choice of task and users implements a realistic approximation that provides insight into how
to minimize the impact of biases and heuristics in scientific workflows. This endeavor is important
                                                                                                     87


because, as described in the beginning of the chapter, domain experts are not immune from cognitive
biases and heuristics and often benefit from normative analysis strategies.


5.5     Concluding Remarks
We presented results from a quantitative user study demonstrating that controlled changes in the
interface of an analysis system can be employed to correct potential deficiencies in users’ analytic
behavior. Specifically, we manipulated the design of a basic visual analysis tool over a set of three
analysis sessions to produce three changes in our subjects’ analysis. First, subjects were nudged to
increase their reliance on the analysis support module which accompanied the visualization. Second,
subjects were nudged to analyze hypotheses in parallel rather than sequentially. Third, subjects
were nudged to gather more evidence for their hypotheses. The results of our user study led us
to conclude that once deficient analytic behavior is identified in a scientific workflow supported by
visual interfaces, it is likely that those interfaces can be redesigned to correct that behaviour.
   The significance of our work is three-fold. First, we give an account of how even the simplest
design decisions shape users’ analytic behavior. Second, we advance visual analytics efforts by
introducing and validating an approach that leverages visualization environments to correct analytic
biases and heuristics reported by cognitive science literature. Third, we provided a short overview
of analysis workflows, and biases and heuristics that our subjects used on a scientifically inspired
analysis task.
Chapter 6

Discussion and Conclusion

This dissertation exemplifies how visualization supports data driven scientific discovery from data
representation, through exploration and understanding to hypothesis elicitation and testing. It
captures the interplay between domain specific contributions, designed and evaluated through close
collaborations with domain experts, and wide ranging contributions that extend to multiple fields,
are inspired by general theories and are evaluated through rigorous user studies.
   This concluding chapter starts by reiterating the concrete contributions of this dissertation and
by emphasizing its impact. It continues with a few discussion points pertaining to the dissertation
as a whole and several future research directions it directly inspires. It ends with a brief summary
of the work.


6.1     Contributions
The contributions of this dissertation are improvements through novel visualization techniques in
neuroscience, proteomic and genomic data analysis, a novel visualization distribution mode, as well
as a quantitative study of how interface design can be used to “nudge” scientists towards more
efficient and correct analytic practices.
   In neuroscience a novel interaction paradigm is introduced: 3D stream-tube models of white
matter in the brain are linked to two-dimensional abstractions defined from the same data. In par-
ticular, a novel two-dimensional representation of white matter tractography that has the desirable
properties of low-dimensional representations while preserving anatomically meaningful coordinates
was developed. A concrete visualization system for analyzing white matter tractograms has been
built. Accessibility to DTI visualizations for browsing purposes is enhanced by using a novel dis-
semination mode based on the Google Maps API. Data gathered from a formal and an anecdotal
evaluation demonstrates the benefits of these approaches.
   In proteomics design guidelines for visualizing protein interaction networks and experimental
proteomic data are presented and evaluated with domain experts. Drawing protein networks by


                                                88
                                                                                                     89


scaffolding publicly available protein interactions onto stylized pathway drawings enhances pro-
teomicists’ interpretation of the network data. Exploring protein networks at multiple levels of
detail using focus+context techniques supports proteomic workflows. Combining publicly available
protein interaction networks with new experimental data accelerates the discovery process.
   In genomics we show that disseminating data as precomputed visualizations using the Google
Maps API can efficiently support many analysis tasks and reduce or eliminate several overheads.
Examples of specific visualizations of micro-array data and their evaluation with domain specialists
are presented. We also show that this method is potentially domain independent. Further con-
tributions include design elements, challenges and opportunities when working with pre-computed
visualizations and the Google Maps API.
   Finally, following the visual analytics path, a quantitative user study shows how scientific analysis
can be improved in terms of hypothesis correctness and closeness to normative analysis guidelines by
variations in a system’s interface design. We posit that this approach may facilitate the use of visual
analytics expertise to correct biases and heuristics documented in the cognitive science community.


6.2     Impact and Generality of this Dissertation
Our three domain specific contribution areas provide immediate design guidelines for building visu-
alization systems that support the work of scientists in those fields. Moreover, these contributions
are visualization and computer science contributions in their own right because they fuel further
innovation within computer science and because they are, at least to some extent, generalizable to
other application areas. Finally, our nudging contribution is independent of a particular domain and
even of visualization itself. It can therefore impact any computer driven analysis and thus contribute
to better research in a wide range of domains.
   We are confident that the contributions presented in Chapters 2-4 can directly impact neurosci-
entists, genomic and proteomic researchers, help them understand the inner workings of the human
body and mind, and devise better drugs and treatments that will improve our quality of life. Our
confidence is rooted in the fact that these contributions are a product of collaborative, interdis-
ciplinary design. They are motivated by analysis shortcomings in neuroscience, proteomics and
genomics identified by domain experts in those fields. Their design and evaluation were performed
with the assistance of those ultimately benefitting those methods.
   Additionally, these contributions can fuel the inspiration of other visualization researchers to
build upon our results. Such work already exists [128, 146, 106] but we believe our results have po-
tential to further impact the design of visualization methods for domain driven network exploration,
neural circuitry analysis, or genomic data browsing.
   Also, such domain driven visualization contributions directly benefit the domains that motivated
them but often generalize to other application areas as well. Existing work that builds on our
approaches to create visualization solutions in domains such as archeology [110] or seismology [62]
vouches for that. Our application of the Google maps paradigm to neuroscience, though originally
                                                                                                    90


motivated by genomics, is another example. Interacting with white matter tractograms via simpli-
fied, two-dimensional proxies can be applied to any stream-line visualization such as those created
for fluid-flow datasets. Finally, many of the benefits of collating new experimental data with known
network information that is projected in meaningful spaces, as described in Chapter 3, are likely to
be transferable to neural circuitry analysis (see section 6.4.2).
   Moreover, it is important to note that the generality of contributions is often tied to their speci-
ficity. In the previous paragraph we gave a few examples of specific paradigms that are almost
directly transferable to other domains. However, high level contributions are likely to be applicable
to a wider range of application areas, albeit with additional refactoring work. Examples of such con-
tributions from this dissertation are: interacting with complex data types through alternate, simpler
proxies that abstract certain aspects of the data; using linked views as a path towards understand-
ing complex data sets; tightly coupling visualizations of many relevant data-sources together into a
unitary system (e.g. new experimental data, known information, publications etc.); or distributing
raw data along with readily analyzable views of it. While such contributions do not provide a step-
by-step recipe for building a system in a new domain, they do provide overarching design principles
that can guide visualization developers.
   Finally, the “nudging” paradigm, applied to the visual analytics domain, represents a novel and
viable solution for bridging the gap between descriptive and normative analysis using visual analytics
methodologies. The need of supporting human analysis against analytic biases and heuristics had
been recognized as one of the major objectives of visual analytics [73, 158]. The results presented
here support the hypotheses that interface changes can be leveraged to guide users towards normative
analytic support. Beyond its purely scientific significance, this finding can help catalyze research
into supporting analytic deliberation through the use of interfaces and visualization. Specifically, it
can provide a foundation for future visual analytics research to determine biases and heuristics that
would benefit from “nudging”, perhaps across multiple domain areas, efficient interfaces nudges,
and evaluations of their effects. At a high level, this contribution is independent of a particular
domain and even of visualization itself and can therefore impact any computer driven analysis and
contribute to better research in a wide range of domains.


6.3     Discussion Items
This section presents a few discussion points pertaining to the dissertation as a whole. More detailed
items for each of the four specific topics presented in this dissertation can be found at the end of
Chapters 2-4.
   Visualization can be thought of as a box of tools and building materials. Visualization researchers
use these tools to build analysis systems that help researchers understand and hypothesize about
their data faster, with less effort, or more correctly. Many visualization systems, although assembled
using the same tools and materials, are unique, novel and useful through their design and the
problems they solve. The work we presented exemplifies how general building blocks are shared
                                                                                                       91


across multiple contributions areas: brushing and linking in multiple views helps both neuroscientists
and proteomicists, low dimensional representations are applicable for both genomic data and 3D
neurological datasets, while the use of a digital map framework can be useful in the biology realm
as well as in neuroscience.
   While each chapter can be thought of as an independent unit, covering its own research agenda,
methodology and contributions, the dissertation is unified by its attempt to use visualization to help
scientists perform their analysis more accurately and more efficiently. It seizes on the opportunity
to demonstrate the benefits of visualization innovation in three concrete areas but at the same time
proposes a quantitative approach to improve scientific analysis workflows that is independent of any
particular visualization and domain.
   The contributions of the presented work cover a continuum of specificity to generality. Chapters
2 and 3 target specific analysis workflows and tasks: white matter tractography and tract-bundle
selection, and analysis of proteomic experimental data in the context of available protein interaction
information. Chapter 3 shows that representing data as pre-computed digital maps is desirable in a
range of analysis tasks, and, while inspired by genomics, is not limited to a particular scientific field.
Finally, the quantitative study on analysis “nudging” is domain independent and demonstrates how
interface elements can be used to guide users towards better analysis regardless of the visualizations
they use.
   The dissertation combines qualitative and quantitative evaluation to measure the effectiveness
of novel techniques or design principles that are introduced. Quantitative evaluations, as part of
controlled user studies, are generally desirable as means of evaluation because they show numerical
proof of a new method’s efficiency, and quantify the performance gain at the same time. However,
many of the methods presented in this dissertation address domain experts and complex, targeted
scientific tasks. In such cases it is often hard to find enough users to achieve statistical relevance
and to design tasks that are simple enough to be measured and reproduced yet complex enough
to represent a meaningful scientific analysis unit. Anecdotal evaluations with domain experts are
convenient for iterative development and tight collaborative settings and will generally offer a good
approximation of user preferences and performances. For example, the finding that proteomic re-
searchers are deterred from analysis by unstructured network representations has been determined
through anecdotal evaluation, and has been independently verified by two concurrent studies. Sim-
ilarly, our finding that planar abstractions attached to 3D white matter tractograms accelerated
tract selection was also confirmed and quantified by concurrent research.
   A unifying principle throughout this document is Brooks’ “computer scientists as toolsmiths”
paradigm. Each of the contributions presented in this dissertation is a direct result of collaborative
work with domain experts from neuroscience, proteomics, genomics and cognitive science. Using
researchers’ real problems to drive visualization innovation, while often laborious for both domain
experts and computer science researchers, will ultimately identify where computers can help most
and benefit all parties.
                                                                                                      92


6.4     Open research opportunities
In line with the collaborative nature of visualization, future research opportunities in the few specific
domains presented here should be identified by continuing the collaborative discovery processes with
researchers in neuroscience, proteomics and genomics. A few immediately foreseeable opportunities,
however, are inspired by the interplay between the different components of this dissertation and by
feedback and observations gathered as part of our work.


6.4.1    Data Infrastructure for Distributing, Analyzing and Cross-Referencing
         Neurological Data
A tighter integration of neurological datasets, along with appropriate visualization and querying
capabilities, into both clinical and research neuroscientific communities is desirable yet missing. Ge-
nomic and proteomic data, results, and publications are distributed across a wide range of websites,
databases and systems that provide querying, visualization, processing and analysis algorithms for
biological data. In general, such data-sources are tightly cross-referenced to each other. This dis-
tributed knowledge system allows small research communities to build on top of data infrastructures
created and maintained by bigger research laboratories and institutions. It also allows researchers
to easily relate their new data and hypotheses to existing and evolving knowledge.
   Such a data infrastructure is lacking in neuroscience. Datasets are sparsely available, can usually
be only accessed as raw data files, and cross-referencing is nearly inexistent. We believe part of
this landscape is a consequence of the inherently 3D data and visualizations which are not easily
translatable to the text-based web and querying paradigms. Our mapping approach is a first step
in the direction of web-accessible neurological datasets. Additional work is necessary however, to
create web-deployed tools and visualizations that are more robust, dynamic and suited for querying
and analyzing neurological data. We believe visualization technologies fueled by this application
area could then be applied to a plethora of domains relying on the analysis of three dimensional
data.


6.4.2    Creating Tools for Analyzing Neurological Networks
Our results on protein interaction pathways inspire work on network analysis tools for neurological
data. The ensemble of neurons in the human brain essentially forms networks that connect different
regions of the brain at different scales and complexities. Just as protein pathways, such neural
networks are associated with different cognitive functions, all worth studying and understanding.
   Advances in data acquisition techniques (DTI, fluorescent microscopy) provide new and efficient
ways of mapping neural connectivity in the brain, at scales that span both inter- and intra-brain
regions. Efforts to document such connections in open databases have started to emerge. This
creates opportunities similar to the ones already tackled in proteomics: developing novel neural
network visualizations that collate data from multiple sources: connectivity databases, experimental
data, and meta data (e.g., publications, anatomical atlases).
                                                                                                    93


   The particularities of the neuroscience domain render this problem interesting and challenging
at the same time. First, unlike protein interaction networks, neurological circuitry resides in an
anatomical space that is highly relevant for neuroscientists. Moreover, this space is three-dimensional
and thus poses representational and interaction challenges. Second, neurological networks can be
analyzed at different scales: connections that link major brain regions, connections within those
regions, or individual neurons. Third, anatomical particularities of each individual brain, especially
in diseased cases, influence the integrity and strength of brain connectivity.


6.4.3    Automatically Suggesting Viable Hypotheses in Protein Pathway
         Analysis
The qualitative analysis in our “nudging” study (Chapter 5) revealed that our subjects exhibited
cognitive biases while exploring our protein networks and identifying potential hypotheses. More-
over, protein interaction networks are dense, complex and becoming increasingly so. Thus, simply
immersing scientists into network visualizations, even under the “guidance” of an experimental
dataset, may not be the most effective way towards new discoveries.
   An alternative is using bioninformatic algorithms to automatically suggest sets of likely hy-
potheses to researchers. We posit that even the simplest of methods, shortest-paths computation for
example, would perhaps accelerate the analysis process. More advanced analytic methods, however,
such as Petri-Net or Baysian modeling, are likely to complement human analytic abilities and to
constrain the analysis space to only the most viable hypotheses. In general we believe that a tight
integration between human abilities, be it cognitive or perceptual, and computer specific strengths
is the likely path to optimal analysis.


6.4.4    Cognitive and Domain Driven Analysis Tools and Visualizations
An immediate extension required before the “nudging” paradigm presented in Chapter 5 can gain
practical applicability, is in quantifying the nudging potential of a set of common interface design
elements and constructions. We imagine one or more studies that would use short decision or
deliberation tasks to quantify the nudging abilities of interface characteristics, and, in general, of
elements that are part of the distribution and presentation of an analysis tool. Examples of tested
items could include widget type, placement, color, size and styling, alert and help messages, or
tutorials. A careful design would allow us to crowd-source such user studies [103] thus making them
manageable, extensible and sufficiently representative. Such studies could yield readily applicable
design guidelines for practitioners to use in the development of new user interfaces.
   A second extension required for “nudges” to be used in the ways envisioned in this dissertation,
is in determining desired and undesired behavior. While in our current work we considered general
biases and heuristics, as described by the cognitive science community, we believe that domain
specific particularities can determine what constitute optimal or deficient decision strategies. This
view is subscribed to the more general goal of understanding how to adjust analysis tools and
                                                                                                    94


visualizations to match the cognitive particularities of specific domains, domain categories, and
analysis situations.
   To exemplify, naturalistic decision making (NDM) [105] represents a branch of cognitive science
that models cognitive processes that occur in time-sensitive situations and suggests effective decision
making strategies in such environments (e.g., disaster response, emergency medics, military opera-
tions). While a few visual analytics efforts have already prototyped disaster response applications,
the main supported scenario involved collaborative and distributed decision making in which field
agents equipped with small mobile displays communicate with an operational headquarter. As such,
contributions were generally limited to extreme resolution visualization (e.g., small for field opera-
tives, large for HQ), and supporting distributed and collaborative analysis (e.g., co-located around
a table-top display, distributed between operatives).
   We find a different, yet interesting approach is to tailor analysis methods based on the cognitive
particularities, constraints and limitations of individuals placed in NDM situations. For example,
the nudging paradigm as described in this dissertation would have to be adapted for a NDM set-
ting: nudging may need to encourage fast, heuristic decision making rather than highly deliberative
analysis. Moreover, certain interface design patters, interaction techniques or analytic workflows
may overload NDM users. A wide range of cognitive principles applicable for time-sensitive decision
making should thus be matched to appropriate analytic tools.
   Such thinking could be applied to visualization as well. Visual attributes and cues, aesthetic
and design principles, and even entire visualization methods that are effective in ordinary analysis
settings might be unsuited for decision-making that is quick, heuristical and driven by limited
cognitive bandwidth. A potential hypothesis is that sparsity, aggregation, emphasis on strong visual
cues, and a limitation of subtle cues, data attributes and dimensions might encourage fast decision
making and benefit NDM settings even if representing the data less faithfully. Another hypothesis is
that some visualization methods require more cognitive bandwidth than others, thus making them
less suited for quick analysis. A few concrete questions that could make the object of such research
spring to mind. Are heatmap representations more suited for quick decision making than parallel
coordinates? Is plotting a small number of the most representative datapoints in a dataset plot less
visually inhibiting than showing the entire data? Is a heatmap using a discrete and limited color
range easier to interpret than one using a wide and continuous color palette? While binary answers
to some of these questions may seem trivial, quantifying differences in perceived complexity, and
whether they translate into an inhibition to make a decision, would allow visual designers to create
systems that fit the constraints of specific application areas.
   In line with the visualization researcher as toolsmith paradigm, we envision developing and eval-
uating such visualization and analysis principles by coupling rigorous experimentation with strong
collaborations in other domains, and development of concrete analysis applications scaffolded on
our findings. Ultimately, this work can lead to analysis systems that are better at supporting and
complementing human perceptual and cognitive abilities given specific analytic contexts.
                                                                                                      95


6.5     Summary
The work presented in this dissertation enables researchers in three specific scientific areas, neuro-
science, proteomics and genomics, to do better analysis in less time. It also lays the foundation for
the development and quantitative evaluation of user interface elements that can unobtrusively guide
scientists towards more efficient and correct analysis workflows, regardless of their fields of research.
   The methods introduced have been developed following the traditional visualization methodology,
which aims to improve the way data is visualized and interacted with, as well as the visual analytics
methodology which aims to improve the analysis process itself.
   Chapter 2 shows how traditional 3D white matter brain models can be linked to planar represen-
tations, some of which novel, to accelerate typical interaction tasks and improve data understand-
ing in neuroscience. Chapter 3 shows how publicly available protein interaction information and
proteomic experimental data can be combined visually to answer scientific question in ways that
harness the researchers’ intuition and support their workflows. Chapter 4 introduces a novel way of
disseminating genomic data and shows how it can facilitate or accelerate data browsing, lightweight
analysis and data dissemination. Chapter 5 embodies the visual analytics approach by combining
the “nudge” paradigm and elements from persuasive technology with a quantitative user study to
show how individual user interface design elements can guide users towards more correct analysis
behavior.
Bibliography

 [1] Circos. http://mkweb.bcgsc.ca/circos/.

 [2] Cutting-edge tools for expression analysis. www.silicongenetics.com.

 [3] Decision site for functional genomics. http://www.Spotfire.com.

 [4] Immgen project. Website. http://www.immgen.org/.

 [5] Ingenuity. http://www.ingenuity.com/.

 [6] A.T. Adai, S.V. Date, S. Wieland, and E.M. Marcotte. LGL: creating a map of protein function
    with an algorithm for visualizing very large biological networks. Journal of Molecular Biology,
    340(1):179–190, 2004.

 [7] D. Akers, A. Sherbondy, R. Mackenzie, R. Dougherty, and B. Wandell. Exploration of the
    brain’s white matter pathways with dynamic queries. In Proc. of Visualization, pages 377–384,
    2004.

 [8] David Akers. Wizard of Oz for participatory design: Inventing an interface for 3d selection
    of neural pathway estimates. In Proceedings of CHI 2006 Extended Abstracts, pages 454–459,
    2006.

 [9] H. Alt and M. Godau. Computing the Frechet distance between two polygonal curves. Inter-
    national Journal of Computational Geometry and Applications, 5(1):75–91, 1995.

[10] M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online
    media. In Proceedings of the SIGCHI conference on Human factors in computing systems,
    page 980. ACM, 2007.

[11] K. Arakawa, S. Tamaki, N. Kono, N. Kido, K. Ikegami, R. Ogawa, and M. Tomita. Genome
    Projector: zoomable genome map with multiple views. BMC bioinformatics, 10(1):31, 2009.

[12] G. Aravindhan, G.R. Kumar, R.S. Kumar, and K. Subha. AJAX Interface: A Breakthrough
    in Bioinformatics Web Applications.


                                               96
                                                                                                 97


[13] E. Arroyo, L. Bonanni, and T. Selker. Waterbot: exploring feedback and persuasive techniques
    at the sink. In Proceedings of the SIGCHI conference on Human factors in computing systems,
    page 639. ACM, 2005.

[14] S.E. Asch. Studies of independence and conformity: A minority of one against a unanimous
    majority. Psychological monographs, 70(9):1–70, 1956.

[15] K. Backhaus, B. Erichson, W. Plinke, and R. Weiber. Multivariate Analysemethoden: Eine
    anwendungsorientierte Einfuhrung. Springer, 2005.

[16] M.Q.W. Baldonado, A. Woodruff, and A. Kuchinsky. Guidelines for using multiple views
    in information visualization. In Proceedings of the working conference on Advanced visual
    interfaces, pages 110–119. ACM New York, NY, USA, 2000.

[17] A. Barsky, T. Munzner, J. Gardy, and R. Kincaid. Cerebral: visualizing multiple experimen-
    tal conditions on a graph with biological context. IEEE Transactions on Visualization and
    Computer Graphics, 14(6):1253–1260, 2008.

[18] Peter J. Basser, James Mattiello, and Denis LeBihan. Estimation of the effective self-diffusion
    tensor from the nmr spin echo. J Magn Reson B, 103(3):247–254, March 1994.

[19] P.J. Basser, S. Pajevic, C. Pierpaoli, J. Duda, and A. Aldroubi. In vivo fiber tractography
    using DT-MRI data. Magnetic Resonance in Medicine, 44(4):625–632, 2000.

[20] G. Ben-Shakhar, M. Bar-Hillel, Y. Bilu, and G. Shefler. Seek and ye shall find: Test results
    are what you hypothesize they are. Journal of Behavioral Decision Making, 11(4):235–249,
    1998.

[21] S.I. Berger, R. Iyengar, and A. Ma’ayan. AVIS: AJAX viewer of interactive signaling networks.
    Bioinformatics, 23(20):2803, 2007.

[22] E. Bier, E. Ishak, and E. Chi. Entity Workspace: an evidence file that aids memory, inference,
    and reading. Intelligence and Security Informatics, pages 466–472, 2006.

[23] J.W. Bodnar. Making sense of massive data by hypothesis testing. In International Conference
    on Intelligence Analysis, pages 2–4.

[24] M. Bostock and J. Heer. Protovis: A Graphical Toolkit for Visualization. IEEE Transactions
    on Visualization and Computer Graphics, 15(6):1121–1128, 2009.

[25] D. Botstein and K.W. Kohn. Molecular interaction map of the mammalian cell cycle control
    and DNA repair systems. Molecular Biology of the Cell, 10(8):2703–2734, 1999.

[26] N. Boukhelifa, JC Roberts, and PJ Rodgers. A coordination model for exploratory multi-
    view visualization. In Coordinated and Multiple Views in Exploratory Visualization, 2003.
    Proceedings. International Conference on, pages 76–85, 2003.
                                                                                               98


[27] F.P. Brooks Jr. The computer scientist as toolsmith II. Communications of the ACM, 39(3):61–
    68, 1996.

[28] A. Brun, H.J. Park, H. Knutsson, and C.F. Westin. Coloring of DT-MRI fiber traces using
    Laplacian eigenmaps. Lecture Notes in Computer Science, pages 518–529, 2003.

[29] J.S. Bruner and M.C. Potter. Interference in visual recognition. Science, 144(3617):424–425,
    1964.

[30] A. Buja, JA McDonald, J. Michalak, W. Stuetzle, and M. Bellcore. Interactive data visualiza-
    tion using focusing and linking. In IEEE Conference on Visualization, 1991. Visualization’91,
    Proceedings., pages 156–163, 1991.

[31] A.J. Ca˜
            nas, R. Carff, G. Hill, M. Carvalho, M. Arguedas, T.C. Eskridge, J. Lott, and R. Car-
    vajal. Concept maps: Integrating knowledge and information visualization. Lecture Notes in
    Computer Science, 3426:205, 2005.

[32] S.K. Card, J.D. Mackinlay, and B. Shneiderman. Readings in information visualization: using
    vision to think. Morgan Kaufmann, 1999.

[33] M. Catani, R.J. Howard, S. Pajevic, and D.K. Jones. Virtual in vivo interactive dissection of
    white matter fasciculi in the human brain. Neuroimage, 17(1):77–94, 2002.

[34] M. Chalmers. A linear iteration time layout algorithm for visualising high-dimensional data.
    In Proceedings of the 7th conference on Visualization’96. IEEE Computer Society Press Los
    Alamitos, CA, USA, 1996.

[35] L.J. Chapman and J.P. Chapman. Illusory correlation as an obstacle to the use of valid
    psychodiagnostic signs. Journal of Abnormal Psychology, 74(3):271–280, 1969.

[36] T.L. Chartrand and J.A. Bargh. The chameleon effect: The perception-behavior link and
    social interaction. Journal of personality and social psychology, 76(6):893–910, 1999.

[37] Wei Chen, Ziang Ding, Song Zhang, Anna MacKay-Brandt, Stephen Correia, Huamin Qu,
    John Allen Crow, David F. Tate, Zhicheng Yan, and Qunsheng Peng. A novel interface for
    interactive exploration of dti fibers. IEEE TVCG (Proc. of Visualization), 2009.

[38] R.B. Cialdini and N.J. Goldstein. Social influence: Compliance and conformity. 2004.

[39] A. Clark and D. Chalmers. The extended mind. Analysis, 58(1):7, 1998.

[40] Andy Cockburn and Bruce McKenzie. Evaluating the effectiveness of spatial memory in 2d
    and 3d physical and virtual environments. In CHI’02, pages 203–210, 2002.

[41] I. Corouge, S. Gouttard, and G. Gerig. Towards a shape model of white matter fiber bundles
    using diffusion tensor MRI. In IEEE International Symposium on Biomedical Imaging: Nano
    to Macro, 2004, pages 344–347, 2004.
                                                                                                99


[42] S. Correia, J.A. Crow, D.F. Tate, and Q. Peng. Wei Chen Member, IEEE Zi’ang Ding Song
    Zhang Member, IEEE Anna MacKay-Brandt. IEEE Transactions on Visualization and Com-
    puter Graphics, 15(06).

[43] C.M. Danis, F.B. Viegas, and M. Wattenberg. Your place or mine?: visualization as a com-
    munity component. In Proceedings of CHI, 2008.

[44] U.N. Danner, H. Aarts, and N.K. de Vries. Habit formation and multiple means to goal attain-
    ment: Repeated retrieval of target means causes inhibited access to competitors. Personality
    and Social Psychology Bulletin, 33(10):1367, 2007.

[45] A.C.E. Darling, B. Mau, F.R. Blattner, and N.T. Perna. Mauve: multiple alignment of
    conserved genomic sequence with rearrangements. Genome research, 14(7):1394, 2004.

[46] R. Davidson and D. Harel. Drawing graphs nicely using simulated annealing. ACM Transac-
    tions on Graphics (TOG), 15(4):301–331, 1996.

[47] E. Demir, O. Babur, U. Dogrusoz, A. Gursoy, G. Nisanci, R. Cetin-Atalay, and M. Ozturk.
    PATIKA: an integrated visual environment for collaborative construction and analysis of cel-
    lular pathways. Bioinformatics, 18(7):996, 2002.

[48] Cagatay Demiralp and David H. Laidlaw. Similarity coloring of dti fiber tracts. In Proceedings
    of DMFC Workshop at MICCAI, 2009.

[49] M. Deutsch and H.B. Gerard. A study of normative and informational social influences upon
    individual judgment. Journal of abnormal and social psychology, 51(3):629–636, 1955.

[50] G.W. Dickson, G. DeSanctis, and D.J. McBride. Understanding the effectiveness of computer
    graphics for decision support: a cumulative experimental approach. Communications of the
    ACM, 29(1):47, 1986.

[51] Z. Ding, J.C. Gore, and A.W. Anderson. Classification and quantification of neuronal fiber
    pathways using diffusion tensor MRI. Magnetic Resonance in Medicine, 49(4):716–721, 2003.

[52] W.M. DuCharme. Response bias explanation of conservative human inference. Journal of
    Experimental Psychology, 85(1):66–74, 1970.

[53] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience Publica-
    tion, 2nd edition, 2000.

[54] K. Dunbar. Concept discovery in a scientific domain*. Cognitive Science, 17(3):397–434, 1993.

[55] K. Dunbar. What scientific thinking reveals about the nature of cognition. Designing for
    science: Implications from everyday, classroom, and professional settings, pages 115–140, 2001.
                                                                                                100


[56] T. Dwyer, Y. Koren, and K. Marriott. IPSep-CoLa: An incremental procedure for separation
    constraint layout of graphs. IEEE Transactions on Visualization and Computer Graphics,
    12(5):821–828, 2006.

[57] P. Eades. A heuristic for graph drawing. Congressus Numerantium, 42(149160):194–202, 1984.

[58] P. Eades and CFX De Mendonca. Vertex splitting and tension-free layout. Lecture Notes in
    Computer Science, pages 202–211, 1995.

[59] R. Eccles, T. Kapler, R. Harper, and W. Wright. Stories in geotime. Information Visualization,
    7(1):3–17, 2008.

[60] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display
    of genome-wide expression patterns.       Proceedings of the National Academy of Sciences,
    95(25):14863, 1998.

[61] A.S. Elstein and A. Schwarz. Clinical problem solving and diagnostic decision making: selective
    review of the cognitive literature. Stroke, 33:493–6, 2002.

[62] C. Engelsma and D. Hale. Visualization of 3d tensor fields derived from seismic images.

[63] W.R. Fisher. Narration as a human communication paradigm. Contemporary rhetorical theory:
    A reader, page 265, 1999.

[64] BJ Fogg.    Persuasive computers: perspectives and research directions.        In Proceedings
    of the SIGCHI conference on Human factors in computing systems, pages 225–232. ACM
    Press/Addison-Wesley Publishing Co., 1998.

[65] A. Frick, A. Ludwig, and H. Mehldau. A fast adaptive layout algorithm for undirected graphs.
    In Proceedings of the DIMACS International Workshop on Graph Drawing, pages 388–403.
    Springer-Verlag London, UK, 1994.

[66] Y. Frishman and A. Tal. Online dynamic graph drawing. IEEE Transactions on Visualization
    and Computer Graphics, pages 727–740, 2008.

[67] T.M.J. Fruchterman and E.M. Reingold. Graph drawing by force-directed placement. Software-
    Practice and Experience, 21(11):1129–1164, 1991.

[68] T.M.J. Fruchterman, E.M. Reingold, Dept. of Computer Science, and University of Illinois at
    Urbana-Champaign. Graph drawing by force-directed placement. Software: Practice and
    Experience, 21(11):1129–1164, 1991.

[69] GW Furnas. Generalized fisheye views. ACM SIGCHI Bulletin, 17(4):23, 1986.

[70] Steven Gomez, Radu Jianu, and David H. Laidlaw. A fiducial-based tangible user interface
    for white matter tractography. In Proceedings of ISVC 2010, 2010.
                                                                                                  101


[71] D. Gotz, M.X. Zhou, and V. Aggarwal. Interactive visual synthesis of analytic knowledge. In
    Proceedings of the IEEE Symposium on Visual Analytics Science & Technology, pages 51–58,
    2006.

[72] Henry Gray. Anatomy of the Human Body. Lea & Febiger, 1918.

[73] T.M. Green, W. Ribarsky, and B. Fisher. Building and applying a human cognition model for
    visual analytics. Information Visualization, 8(1):1–13, 2009.

[74] DL Gresh, BE Rogowitz, RL Winslow, DF Scollan, and CK Yung. WEAVE: A system for visu-
    ally linking 3-D and statistical visualizations, applied to cardiac simulation and measurement
    data. In Proceedings of the conference on Visualization’00, pages 489–492. IEEE Computer
    Society Press Los Alamitos, CA, USA, 2000.

[75] B. Gretarsson, S. Bostandjiev, J. ODonovan, and T. Hollerer. WiGis: A Framework for
    Scalable Web-based Interactive Graph Visualizations.

[76] R. Hastie and R.M. Dawes. Rational choice in an uncertain world. Journal of the Indian
    Academy of Applied Psychology, page 107, 2003.

[77] M. Hegarty. Dynamic visualizations and learning: Getting to the difficult questions. Learning
    and Instruction, 14(3):343–352, 2004.

[78] N. Henry, A. Bezerianos, and J.D. Fekete. Improving the Readability of Clustered Social Net-
    works using Node Duplication. IEEE Transactions on Visualization and Computer Graphics,
    14(6):1317–1324, 2008.

[79] R.J. Heuer. Psychology of intelligence analysis. United States Govt Printing Office, 1999.

[80] H. Hochheiser, E.H. Baehrecke, S.M. Mount, and B. Shneiderman. Dynamic querying for
    pattern identification in microarray and genomic data. In Proceedings of IEEE International
    conference on Multimedia and Expo, volume 3, pages 453–456. Citeseer, 2003.

[81] D. Holten. Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data.
    Visualization and Computer Graphics, IEEE Transactions on, 12(5):741–748, 2006.

[82] Z. Hu, J. Mellor, J. Wu, T. Yamada, D. Holloway, and C. DeLisi. VisANT: data-integrating
    visual framework for biological networks and modules. Nucleic acids research, 33(Web Server
    Issue):W352, 2005.

[83] Visual Analytics Inc. Website. http://www.visualanalytics.com.

[84] P. Isenberg, A. Tang, and S. Carpendale. An exploratory study of visual information analysis.
    In Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing
    systems, pages 1217–1226. ACM, 2008.
                                                                                              102


[85] R. Jianu, C. Demiralp, and D. Laidlaw. Exploring 3d dti fiber tracts with linked 2d represen-
    tations. IEEE TVCG (Proc. of Visualization), 15(6):1449–1456, 2009.

[86] R. Jianu, K. Yu, L. Cao, V. Nguyen, A.R. Salomon, and D.H. Laidlaw. Visual integration
    of quantitative proteomic data, pathways and protein interactions. IEEE Transactions on
    Visualization and Computer Graphics.

[87] Radu Jianu, Cagatay Demiralp, and David H. Laidlaw. Exploring brain connectivity with
    two-dimensional neural maps. In IEEE Visualization 2010 Poster Compendium, 2010.

[88] Radu Jianu, Cagatay Demiralp, and David H. Laidlaw. Exploring the brain connectivitywith
    two-dimensional neuralmaps. IEEE Transactions on Visualization and Computer Graphics,
    2010.

[89] Radu Jianu, Cagatay Demiralp, and David H. Laidlaw. Exploring the brain connectivitywith
    two-dimensional neuralmaps. IEEE Visualization Poster Compendium, 2010.

[90] Radu Jianu, Cagatay Demiralp, and David H. Laidlaw. Visualizing and exploring tractograms
    via two-dimensional connectivity maps. In Proceedings of ISMRM’10, 2010.

[91] Radu Jianu and David H. Laidlaw. Visualizing gene co-expression as google maps. In ISVC
    Proceedings 2010, 2010.

[92] Radu Jianu and David H. Laidlaw. Visualizing protein interaction networks as google maps.
    In IEEE Visualization 2010 Poster Compendium, 2010.

[93] Radu Jianu and David H. Laidlaw. An evaluation of how small user interface changes can
    improve scientists analytic strategies. In Proceedings of SIGCHI (CHI) 2012, 2011.

[94] Radu Jianu and David H. Laidlaw. Guiding visualization users towards improved analytic
    strategies using small interface changes. In IEEE Visualization 2011 Poster Compendium,
    2011.

[95] Radu Jianu, David H. Laidlaw, and Arthur Salomon. Visualizing phosphorylation experiments
    data in the context of known protein interactions. IEEE Visualization Poster Compendium,
    2006.

[96] D.W. Johnson and TJ Jankun-Kelly. A scalability study of web-native information visu-
    alization. In Proceedings of graphics interface 2008, pages 163–168. Canadian Information
    Processing Society Toronto, Ont., Canada, Canada, 2008.

[97] M. Jones. Thinker’s Toolkit: 14 Powerful Techniques for Problem Solving, 1998.

[98] F. Jourdan and G. Melan¸con. Tool for metabolic and regulatory pathways visual analysis. In
    Proceedings of SPIE, volume 5009, page 46, 2003.
                                                                                                 103


 [99] L. Kaufman and P.J. Rousseeuw. Finding groups in data: an introduction to cluster analysis.
     New York, 1990.

[100] DA Keim. Information visualization and visual data mining. IEEE transactions on Visualiza-
     tion and Computer Graphics, 8(1):1–8, 2002.

[101] D.T. Kenrick, J.K. Maner, J. Butner, N.P. Li, D.V. Becker, and M. Schaller. Dynamical
     evolutionary psychology: Mapping the domains of the new interactionist paradigm. Personality
     and Social Psychology Review, 6(4):347, 2002.

[102] W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, T.H. Pringle, A.M. Zahler, et al. The
     human genome browser at UCSC. Genome research, 12(6):996, 2002.

[103] A. Kittur, E.H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In
     Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing
     systems, pages 453–456. ACM, 2008.

[104] J. Klayman and Y.W. Ha. Confirmation, disconfirmation, and information in hypothesis
     testing. Psychological Review, 94(2):211–228, 1987.

[105] G.A. Klein. A recognition-primed decision (rpd) model of rapid decision making. Decision
     making in action: Models and methods, pages 138–147, 1993.

[106] J. Klein, M. Scholl, A. Kohn, and H.K. Hahn. Real-time fiber selection using the wii remote.
     In Proceedings of the SPIE, volume 7625, 2010.

[107] H. Kuehn, A. Liberzon, M. Reich, and JP Mesirov. Using GenePattern for gene expression
     analysis. Current protocols in bioinformatics/editoral board, Andreas D. Baxevanis...[et al.],
     2008.

[108] M. Kumar and T. Kim. Dynamic speedometer: dashboard redesign to discourage drivers from
     speeding. In CHI’05 extended abstracts on Human factors in computing systems, page 1576.
     ACM, 2005.

[109] D. Lockton, D. Harrison, and N. Stanton. Design with intent: Persuasive technology in a
     wider context. Persuasive Technology, pages 274–278, 2008.

[110] A. Loomis and M. Watters. Multimodal volume visualization of geophysical data for archae-
     ological analysis.

[111] C.G. Lord, L. Ross, and M.R. Lepper. Biased assimilation and attitude polarization: The
     effects of prior theories on subsequently considered evidence. Journal of Personality and Social
     Psychology, 37(11):2098–2109, 1979.

[112] M. Maddah, A.U.J. Mewes, S. Haker, W.E.L. Grimson, and S.K. Warfield. Automated atlas-
     based clustering of white matter fiber tracts from DT-MRI. Lecture Notes in Computer Science,
     3749:188, 2005.
                                                                                                  104


[113] M. Meyer, T. Munzner, and H. Pfister. MizBee: A Multiscale Synteny Browser. IEEE Trans-
      actions on Visualization and Computer Graphics, 15(6):897–904, 2009.

[114] G. Michal. On representation of metabolic pathways. BioSystems, 47(1-2):1–7, 1998.

[115] MindManager. Website. http://www.mindjet.com.

[116] Bart Moberts, Anna Vilanova, and Jarke J. van Wijk. Evaluation of fiber clustering methods
      for diffusion tensor imaging. In Procs. of Vis’05, pages 65–72, 2005.

[117] S. Mori and P.C.M. Van Zijl. Fiber tracking: principles and strategies-a technical review.
      NMR in Biomedicine, 15(7-8):468–480, 2002.

[118] S. Mori and P.C.M. van Zijl. Fiber tracking: principles and strategies ˘2013 a technical review.
      NMR in Biomedicine, 15(7-8):468–480, 2002.

[119] A. Morrison and M. Chalmers. A pivot-based routine for improved parent-finding in hybrid
      MDS. Information Visualization, 3(2):109–122, 2004.

[120] T. Munzner. H3: Laying out large directed graphs in 3D hyperbolic space. In IEEE Symposium
      on Information Visualization, 1997. Proceedings., pages 2–10, 1997.

[121] T. Munzner, F. Guimbreti`ere, and G. Robertson. Constellation: a visualization tool for linguis-
      tic queries fromMindNet. In 1999 IEEE Symposium on Information Visualization, 1999.(Info
      Vis’ 99) Proceedings, pages 132–135, 1999.

[122] Vinh Nguyen, Lulu Cao, Jonathan Lin, Anna Ritz, Norris Hung, Radu Jianu, Benjamin
      Raphael, David H. Laidlaw, Laurent Brossay, and Arthur Salomon. A new approach for
      quantitative phosphoproteomic dissection of signaling pathways applied to T cell receptor
      activation. Molecular and Cellular Proteomics, 8(11):2418–2431, 2009.

[123] C.L. North, B. Shneiderman, and Human/Computer Interaction Laboratory. A taxonomy
      of multiple window coordinations. Human-Computer Interaction Laboratory, Institute for
      Advanced Computer Studies, 1997.

[124] A. Notebook. i2 Analyst Notebook. i2 Ltd,¡ http://www. i2. co. uk/¿ Viewed at, 31, 2007.

[125] O. Nov. What motivates wikipedians? Communications of the ACM, 50(11):64, 2007.

[126] H. Oinas-Kukkonen and M. Harjumaa. Towards deeper understanding of persuasion in software
      and information systems. In First International Conference on Advances in Computer-Human
      Interaction, pages 200–205. IEEE, 2008.

[127] R. Otten, A. Vilanova, and H. Van De Wetering. Illustrative White Matter Fiber Bundles.
      Computer Graphics Forum, 29(3):1013–1022, 2010.
                                                                                                105


[128] R. Otten, A. Vilanova, and H. Van De Wetering. Illustrative white matter fiber bundles. In
     Computer Graphics Forum, volume 29, pages 1013–1022. Wiley Online Library, 2010.

[129] T. Pattison and M. Phillips. View coordination architecture for information visualisation. In
     Proceedings of the 2001 Asia-Pacific symposium on Information visualisation-Volume 9, pages
     165–169. Australian Computer Society, Inc. Darlinghurst, Australia, Australia, 2001.

[130] F.V. Paulovich and R. Minghim. HiPP: A novel hierarchical point placement strategy and
     its application to the exploration of document collections. IEEE transactions on visualization
     and computer graphics, 14(6):1229–1236, 2008.

[131] S. Peri, J.D. Navarro, R. Amanchy, T.Z. Kristiansen, C.K. Jonnalagadda, V. Surendranath,
     V. Niranjan, B. Muthusamy, TKB Gandhi, M. Gronborg, et al. Development of human protein
     reference database as an initial platform for approaching systems biology in humans. Genome
     research, 13(10):2363–2371, 2003.

[132] W.A. Pike, R. May, B. Baddeley, R. Riensche, J. Bruce, and K. Younkin. Scalable visual
     reasoning: supporting collaboration through distributed analysis. In Collaborative Technologies
     and Systems, 2007. CTS 2007. International Symposium on, pages 24–32, 2007.

[133] P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology
     as identified through cognitive task analysis. In Proceedings of International Conference on
     Intelligence Analysis, volume 2005, pages 2–4, 2005.

[134] M.D. Plumlee and C. Ware. Zooming versus multiple window interfaces: Cognitive costs of
     visual comparisons. ACM Transactions on Computer-Human Interaction (TOCHI), 13(2):179–
     209, 2006.

[135] Y. Qu and G.W. Furnas. Sources of structure in sensemaking. In CHI’05 extended abstracts
     on Human factors in computing systems, page 1992. ACM, 2005.

[136] EM Reingold and JS Tilford. Tidier drawings of trees. IEEE Transactions on Software
     Engineering, pages 223–228, 1981.

[137] G.G. Robertson, J.D. Mackinlay, and S.K. Card. Cone trees: animated 3D visualizations
     of hierarchical information. In Proceedings of the SIGCHI conference on Human factors in
     computing systems: Reaching through technology, pages 189–194. ACM New York, NY, USA,
     1991.

[138] A.C. Robinson and G. Center. Collaborative synthesis of visual analytic results. In IEEE
     Symposium on Visual Analytics Science and Technology, 2008. VAST’08, pages 67–74, 2008.

[139] R. Rodgers and J.E. Hunter. The discard of study evidence by literature reviewers. The
     Journal of Applied Behavioral Science, 30(3):329, 1994.
                                                                                                106


[140] D.M. Russell, M.J. Stefik, P. Pirolli, and S.K. Card. The cost structure of sensemaking. In
     Proceedings of the INTERACT’93 and CHI’93 conference on Human factors in computing
     systems, pages 269–276. ACM, 1993.

[141] P. Saraiya, C. North, and K. Duca. An insight-based methodology for evaluating bioinformatics
     visualizations. IEEE Transactions on Visualization and Computer Graphics, 11(4):443–456,
     2005.

[142] M. Sarkar and M.H. Brown. Graphical fisheye views of graphs. Communications of ACM,
     37(12):73–84, 1994.

[143] Debra MacIvor Savage, Eric N. Wiebe, and Hugh A. Devine. Performance of 2d versus 3d
     topographic representations for different task types. In HFES Annual Meeting, 2004.

[144] A. Savikhin, R. Maciejewski, and D.S. Ebert. Applied visual analytics for economic decision-
     making. In IEEE Symposium on Visual Analytics Science and Technology, 2008. VAST’08,
     pages 107–114, 2008.

[145] C. Schmid and H. Hinterberger. Comparative multivariate visualization across conceptuallydif-
     ferent graphic displays. In Scientific and Statistical Database Management, 1994. Proceedings.,
     Seventh International Working Conference on, pages 42–51, 1994.

[146] T. Schultz. Feature extraction for dw-mri visualization: The state of the art and beyond. In
     Proc. Schloss Dagstuhl Scientific Visualization Workshop 2009, 2010.

[147] J. Seo and B. Shneiderman. Interactively exploring hierarchical clustering results. Computer,
     pages 80–86, 2002.

[148] P. Shannon, A. Markiel, O. Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin,
     B. Schwikowski, and T. Ideker. Cytoscape: a software environment for integrated models
     of biomolecular interaction networks. Genome research, 13(11):2498, 2003.

[149] B. Shneiderman. Book Preview-Designing the User Interface: Strategies for Effective Human-
     Computer Interaction. Interactions-New York, 4(5):61, 1997.

[150] B. Shneiderman, S.K. Card, J.D. Mackinlay, and B. Shneiderman. Readings in information
     visualization: using vision to think. Morgan Kaufmann, 1999.

[151] H.A. Simon. Rationality as Process and as Product of Thought. The American Economic
     Review, 68(2):1–16, 1978.

[152] A.U. Sinha and J. Meller. Cinteny: flexible analysis and visualization of synteny and genome
     rearrangements in multiple organisms. BMC bioinformatics, 8(1):82, 2007.

[153] M.E. Skinner, A.V. Uzilov, L.D. Stein, C.J. Mungall, and I.H. Holmes. JBrowse: A next-
     generation genome browser. Genome Research, 19(9):1630, 2009.
                                                                                               107


[154] H.S. Smallman and M. Hegarty. Expertise, spatial ability and intuition in the use of com-
     plex visual displays. In Human Factors and Ergonomics Society Annual Meeting Proceedings,
     volume 51, pages 200–204. Human Factors and Ergonomics Society, 2007.

[155] R. Spence. Information visualization. Addison-Wesley Harlow, 2001.

[156] J. Stalker, B. Gibbins, P. Meidl, J. Smith, W. Spooner, H.R. Hotz, and A.V. Cox. The Ensembl
     Web site: mechanics of a genome browser. Genome research, 14(5):951, 2004.

[157] C. Stangor, G.B. Sechrist, and J.T. Jost. Social influence and intergroup beliefs: The role
     of perceived social consensus. Social influence: Direct and indirect processes, pages 235–252,
     2001.

[158] J. Stasko, C. Gorg, and Z. Liu. Jigsaw: supporting investigative analysis through interactive
     visualization. Information Visualization, 7(2):118–132, 2008.

[159] C.R. Sunstein and R.H. Thaler. Libertarian paternalism is not an oxymoron. U. Chi. L. Rev.,
     70:1159, 2003.

[160] DeRose. T., EA Bier, M. Stone, K. Pier, and W. Buxton. Toolglass and magic lenses: the
     see-through interface. In Proceedings of SIGGRAPH, volume 93, pages 73–80.

[161] E. Tejada, R. Minghim, and L.G. Nonato. On improved projection techniques to support
     visual exploration of multi-dimensional data sets. Information Visualization, 2(4):218–231,
     2003.

[162] S.T. Teoh and K.L. Ma. RINGS: A technique for visualizing large hierarchies. Lecture Notes
     in Computer Science, 2528:268–275, 2002.

[163] R.H. Thaler and S. Benartzi. Save More Tomorrow: using behavioral economics to increase
     employee saving. Journal of political Economy, pages 164–187, 2004.

[164] R.H. Thaler and C.R. Sunstein. Nudge: Improving decisions about health, wealth, and happi-
     ness. Yale Univ Pr, 2008.

[165] J.J. Thomas and K.A. Cook. Illuminating the path: The research and development agenda
     for visual analytics. IEEE Computer Society, 2005.

[166] D. Tunkelang. A practical approach to drawing undirected graphs, 1994.

[167] F. van Ham and A. Perer. Search, Show Context, Expand on Demand: Supporting Large
     Graph Exploration with Degree-of-Interest. IEEE Transactions on Visualization and Com-
     puter Graphics, 15(6):953–960, 2009.

[168] F.B. Vi´egas, M. Wattenberg, M. McKeon, F. Van Ham, and J. Kriss. Harry potter and the
     meat-filled freezer: A case study of spontaneous usage of visualization tools. In Proc. HICSS,
     2008.
                                                                                                108


[169] F.B. Viegas, M. Wattenberg, F. Van Ham, J. Kriss, and M. McKeon. Manyeyes: a site for
     visualization at internet scale. IEEE Transactions on Visualization and Computer Graphics,
     13(6):1121, 2007.

[170] I. Viola, M.E. Groller, M. Hadwiger, K. Buhler, B. Preim, M.C. Sousa, D.S. Ebert, and
     D. Stredney. Illustrative visualization. In IEEE Visualization, page 124, 2005.

[171] C. von Mering, L.J. Jensen, B. Snel, S.D. Hooper, M. Krupp, M. Foglierini, N. Jouffre, M.A.
     Huynen, and P. Bork. STRING: known and predicted protein-protein associations, integrated
     and transferred across organisms. Nucleic acids research, 33(Database Issue):D433, 2005.

[172] S. Wakana, H. Jiang, L.M. Nagae-Poetscher, P.C.M. van Zijl, and S. Mori. Fiber Tract-based
     Atlas of Human White Matter Anatomy 1, 2004.

[173] MO Ward. XmdvTool: integrating multiple methods for visualizing multivariatedata. In IEEE
     Conference on Visualization, 1994., Visualization’94, Proceedings., pages 326–333, 1994.

[174] P.C. Wason. Reasoning about a rule. The Quarterly Journal of Experimental Psychology,
     20(3):273–281, 1968.

[175] W. Wright, D. Schroh, P. Proulx, A. Skaburskis, and B. Cort. The Sandbox for analysis: con-
     cepts and methods. In Proceedings of the SIGCHI conference on Human Factors in computing
     systems, pages 801–810. ACM New York, NY, USA, 2006.

[176] I. Xenarios, L. Salwinski, X.J. Duan, P. Higney, S.M. Kim, and D. Eisenberg. DIP, the
     Database of Interacting Proteins: a research tool for studying cellular networks of protein
     interactions. Nucleic acids research, 30(1):303, 2002.

[177] D. Yang, E.A. Rundensteiner, and M.O. Ward. Nugget discovery in visual exploration environ-
     ments by query consolidation. In Proceedings of the sixteenth ACM conference on Conference
     on information and knowledge management, pages 603–612. ACM New York, NY, USA, 2007.

[178] D. Yang, Z. Xie, E.A. Rundensteiner, and M.O. Ward. Managing Discoveries in The Visual
     Analytics Process.

[179] T. Yates, M.J. Okoniewski, and C.J. Miller. X: Map: annotation and visualization of genome
     structure for Affymetrix exon array analysis. Nucleic Acids Research, 36(Database issue):D780,
     2008.

[180] K. Yu, L. Cao, R. Jianu, R. Park, C. Gatsonis, D. Laidlaw, and A. Salomon. A software
     suite to expedite the study of cell signaling pathway: automated acquisition, organization and
     annotation. In 55 th ASMS Conference Proceedings. American Society for Mass Spectrometry,
     2007.

[181] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich, and G. Ce-
     sareni. MINT: a Molecular INTeraction database. FEBS letters, 513(1):135–140, 2002.
                                                                                             109


[182] S. Zhang, C. Demiralp, and DH Laidlaw. Visualizing diffusion tensor MR images using stream-
     tubes and streamsurfaces. IEEE TVCG, 9(4):454–462, 2003.

[183] Experimental data. http://graphics.cs.brown.edu/research/sciviz/nudges/.