Abstract of “Algorithms for Analyzing Human Genome Rearrangements,” by Crystal L.
Kahn, Ph.D., Brown University, May 2011.


The human genome exhibits a rich structure resulting from a long history of genomic
changes, including single base-pair mutations and larger scale rearrangements such as in-
versions, deletions, translocations, and duplications. The number and order of the genomic
changes that resulted in the present-day human genome is not known, but can sometimes
be inferred by comparison to the genomes of other species. In particular, genome rear-
rangements are modeled as operations on signed strings of characters representing blocks
of conserved sequences. Genome rearrangement distance measures quantify the similarity
between two or more genome sequences by counting the minimum, or most likely, number
of rearrangement operations needed to transform one sequence into another. The devel-
opment of efficient algorithms for computing genome rearrangement distances has been
instrumental both in computing phylogenies for sets of known genetic sequences (such as
gene families or the whole genomes of present-day species) and in constructing ancestral
genome sequences.


In this thesis, we develop algorithms to study recent genome rearrangements in human and
cancer genomes. We introduce a novel measure, called duplication distance, to quantify
the similarity between two genomic regions containing segmental duplications. We give
an efficient algorithm to compute the duplication distance between a pair of signed strings
and provide several generalizations of duplication distance that also measure inversions and
deletions. We demonstrate the utility of the duplication distance measure in constructing
the evolutionary history of segmental duplications in the human genome using both parsi-
mony and likelihood techniques. Further, motivated by recent cancer genome sequencing
studies, we present a new algorithm for the block ordering problem of inferring a whole
genome sequence from a partial assembly by maximizing its similarity to another genome.
Algorithms for Analyzing Human Genome Rearrangements


                           by
                    Crystal L. Kahn
             B. A., Amherst College, 2004
            Sc. M., Brown University, 2008


   Submitted in partial fulfillment of the requirements
         for the degree of Doctor of Philosophy
in the Program in Computer Science at Brown University.


               Providence, Rhode Island
                       May 2011

c Copyright 2008, 2009, 2010 by Crystal L. Kahn
       This dissertation by Crystal L. Kahn is accepted in its present form by
   the Department of Computer Science as satisfying the dissertation requirement
                      for the degree of Doctor of Philosophy.


Date
                                        Benjamin J. Raphael, Director


                      Recommended to the Graduate Council


Date
                                               Sorin Istrail, Reader


Date
                                          Bernard Moret, Reader
                                 ´
                                (Ecole Polytechnique F´ed´erale de Lausanne)


Date
                                          Charles Lawrence, Reader
                                            Applied Mathematics


                        Approved by the Graduate Council


Date
                                                Peter Weber
                                        Dean of the Graduate School


                                         iii
                          Curriculum Vitae

Crystal Louise Kahn was born in Mansfield, Ohio on August 27, 1982, the best day of her
older brother’s life as yet. As the daughter of two political science professors fond of sab-
baticals, she traveled a lot and at age eight moved to Fairfield, CT. She was valedictorian of
Fairfield High School, a National Merit Scholar, and a Governor’s Scholar of CT in 2000.
From 2000 to 2004, she attended Amherst College in Amherst, MA where she earned a
Bachelor of Arts in Computer Science, magna cum laude and with distinction. She was
also awarded the Amherst College Computer Science Award and was inducted into the Phi
Beta Kappa honor society in 2004. She spent the year of 2004 to 2005 at the University
of Padova in Italy on a Fulbright Scholarship. In 2005, she began her graduate studies at
Brown University in Providence, RI. She earned her Masters in 2008 while working toward
a doctorate in Computer Science. While at Brown, she received a National Science Foun-
dation Graduate Research Fellowship and an Amherst College Memorial Fund Fellowship.

She currently lives in Cambridge, MA with her fiance who denies that he was once also a
computer scientist.


                                              iv
                         Acknowledgments

First, I would like to thank my advisor, Ben Raphael, for introducing me to computational
biology and for helping me to discover interesting problems involving the design of algo-
rithms. I would also like to thank the other members of my thesis committee. Sorin Istrail,
the fearless leader of the Center for Computational Molecular Biology at Brown, has been
a mentor and has introduced me to many luminaries in the field. Bernard Moret’s research
has influenced and inspired many contributions in this thesis. Chip Lawrence’s insightful
comments on my research were instructional and encouraged me to explore new directions.
Of course, I am also indebted to my student coauthors, Shay Mozes and Borislav Hristov,
for their contributions and energy.

Many members of the computer science community at Brown provided friendship and
moral support (and sometimes even interesting conversation related to computer science).
In particular, I’d like to thank Anna Ritz, Cora Borradaile, Claire Mathieu, Aparna Das,
Suzanne Sindi, and Fabio Vandin.

I am grateful for financial support from the National Science Foundation, Amherst College,
and Brown University.

Finally, I would like to acknowledge my family for their support, generosity, and patience.
I cannot thank my parents (Bev and Bob), my brother (Adam), and my fiance (Mike Valen-
tine) enough for letting me lean on them and for helping me to keep my eyes on the prize.


                                            v
                                    Contents

List of Figures                                                                            viii

1   INTRODUCTION                                                                             1
    1.1   Contributions and Organization . . . . . . . . . . . . . . . . . . . . . . . .     1
    1.2   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     3
          1.2.1   Models of Genome Rearrangement . . . . . . . . . . . . . . . . .           4
          1.2.2   Multiple Genome Rearrangement Algorithms . . . . . . . . . . . .           7
          1.2.3   Analysis of Duplicated Genomic Regions . . . . . . . . . . . . . .         8

2   DUPLICATION DISTANCE: A COMBINATORIAL MODEL OF SEGMEN-
    TAL DUPLICATIONS                                                                        10
    2.1   The Basic Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
    2.2   Extending to Affine Duplication Cost . . . . . . . . . . . . . . . . . . . . 16
    2.3   Extending the Model: Duplication-Deletion Distance . . . . . . . . . . . . 17
    2.4   Extending the Model: Duplication-Inversion Distance . . . . . . . . . . . . 20
    2.5   Extending the Model: Duplication-Inversion-Deletion Distance . . . . . . . 22

3   ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS                                                25
    3.1   The Most-Parsimonious Two-Step Duplication Tree . . . . . . . . . . . . . 28
    3.2   The Max Parsimony and Max Likelihood Duplication History DAGs . . . . 36
          3.2.1   The Partition Function . . . . . . . . . . . . . . . . . . . . . . . . 37
          3.2.2   The Score Function . . . . . . . . . . . . . . . . . . . . . . . . . . 40
          3.2.3   Restricted Partition Function . . . . . . . . . . . . . . . . . . . . . 41
          3.2.4   Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 44
          3.2.5   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
          3.2.6   Maximum Parsimony Reconstruction . . . . . . . . . . . . . . . . 47
          3.2.7   Maximum Likelihood Reconstruction . . . . . . . . . . . . . . . . 51
          3.2.8   The Simulated Annealing Heuristic . . . . . . . . . . . . . . . . . 52
    3.3   Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . 53


                                             vi
4   COMPLETING A PARTIALLY ASSEMBLED GENOME USING REARRANGE-
    MENT DISTANCE                                                                        55
    4.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
    4.2   Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
    4.3   The Breakpoint Graph and the Adjacency Graph . . . . . . . . . . . . . . . 62
    4.4   Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
    4.5   The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
          4.5.1   The Unrestricted Problem . . . . . . . . . . . . . . . . . . . . . . 65
          4.5.2   The Restricted Problem . . . . . . . . . . . . . . . . . . . . . . . . 68
    4.6   Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

BIBLIOGRAPHY                                                                             79

A A DISCUSSION OF ANCESTRAL GENOME RECONSTRUCTION USING
    DUPLICATIONS AND REVERSALS                                                           86

B ANOTHER PROBABILISTIC MODEL OF SEGMENTAL DUPLICATIONS 90

C A DISCUSSION OF THE BLOCK ORDERING PROBLEM USING A BREAK-
    POINT GRAPH FRAMEWORK                                                                96


? Parts of this thesis have appeared before in [36, 37, 33, 34]. I thank my coauthors, Ben
Raphael, Shay Mozes, and Borislav Hristov, for their permission to use portions of these
papers in this thesis.


                                            vii
                          List of Figures

2.1   Overlapping subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2   A subsequence inside another . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3   A duplicate operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4   Case 1 in duplication distance computation . . . . . . . . . . . . . . . . . 13
2.5   Case 2 in duplication distance computation . . . . . . . . . . . . . . . . . 14
2.6   A delete operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1   Complex evolutionary relationships between duplication blocks containing
      segmental duplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2   Duplication blocks in the present-day human genome . . . . . . . . . . . . 29
3.3   The two-step model of segmental duplication . . . . . . . . . . . . . . . . 30
3.4   Most-parsimonious two-step duplication tree for duplication blocks in the
      human genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5   Two-step duplication tree: largest subtrees . . . . . . . . . . . . . . . . . . 35
3.6   A duplication scenario and feasible generator . . . . . . . . . . . . . . . . 37
3.7   Max parsimony evolutionary history DAG for duplication blocks in the
      human genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8   A connected component from max parsimony DAG . . . . . . . . . . . . . 47
3.9   Duplication block recombination . . . . . . . . . . . . . . . . . . . . . . . 48
3.10 Comparison of max parsimony and max likelihood DAGS for clade ‘chr16’             49
3.11 Comparison of max parsimony and max likelihood DAGS for clade ‘chr10’             50
3.12 Distribution of local optima in max parsimony computation for clade ‘chr‘6’ 53

4.1   A genome graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2   A partial genome graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3   An adjacency graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4   A partial adjacency graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5   A breakpoint graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6   Illustration of Claim 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7   A mixture tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78


                                         viii
B.1 A duplication scenario and feasible generator . . . . . . . . . . . . . . . . 91


                                       ix
                         INTRODUCTION

Genome rearrangement events are large changes that occur in a genome sequence as the
result of a chromosomal rupture. The human genome contains evidence of many rear-
rangements that have occurred over the course of evolution. We model a genome as a
signed string on a multiset of synteny blocks that represent contiguous DNA sequences,
corresponding to genes or other markers, that are conserved across genomes. Common
types of sequence rearrangement events that have been studied are insertions (that augment
a sequence by inserting a string of blocks in the middle of it), deletions (in which a sub-
string of blocks is deleted from the middle of a sequence), reversals (in which the order of
a contiguous substring of blocks is reversed), translocations (which are similar to cut-paste
operations within a sequence), and duplication transpositions (which are similar to copy-
paste operations within a sequence). Genome rearrangement distances count the minimum
number of operations required to transform one genome into another. Sorting genomes
by rearrangements is a related problem in which the transforming sequence of rearrange-
ments between a pair of genomes is reconstructed. There has been much work in designing
efficient algorithms for computing rearrangement distances and for sorting genomes by re-
arrangements. They have been used by evolutionary biologists to infer phylogenetic trees
on genomes for different species, to construct putative ancestral genome sequences, and to
assign orthologous genes in related species, among other applications. Genome rearrange-
ments can also be used to model somatic mutations that occur within a single tissue, such
as in a tumor mass.

1.1   Contributions and Organization

In this thesis, we present a number of techniques for analyzing rearrangements that oc-
curred recently in the evolution of the human genome or that occur as the result of somatic
mutations in cancer genomes. The study of genome rearrangements has resulted in a rich
literature of algorithms for computing distance metrics between pairs or sets of genomes.

                                             1
                                                                                            2

We contribute to this body of work by introducing new distance metrics between genomes
that contain repeated elements, a traditionally confounding type of comparison. Computa-
tional biologists have used models of rearrangement, or rearrangement distances, to com-
pute putative rearrangement histories for genomic sequences. For example, rearrangement
distances have been used to characterize paralogous gene families within a genome, iden-
tify orthologous genes in different species, compare whole-genome sequences of different
species, infer ancestral versions of present-day genomes, and study somatic mutations in
cancer genomes. We also use our novel rearrangement distances to compute putative re-
arrangement histories for regions of the human genome containing duplicates using both
parsimony and likelihood criteria. Finally, we discuss a problem in which a genome rear-
rangement distance is used to complete a partially assembled genome sequence by max-
imizing its similarity to another known genome sequence. We suggest a multi-genome
generalization of this problem be used in efforts to sequence genomes from a sample of
cancer cells.

This thesis is organized into four main parts. In the rest of this chapter, we discuss how our
work relates to several relevant results on genome rearrangement distances and algorithms.
In Chapter 2, we consider the problem of comparing genomes that contain duplicated re-
gions. In Chapter 3, we present problem formulations for computing duplication histories
for regions containing segmental duplications and compute putative histories for a set of
segmental duplications in the human genome. Finally, in Chapter 4, we give an algo-
rithm for inferring a whole-genome sequence from a set of partially assembled contigs by
maximizing similarity to a known reference genome. We also formulate the problem of
inferring the content of a mixture of genomes from a set of measured rearrangements, a
problem motivated by cancer sequencing studies. We summarize our main results here:

Duplication Distance We present a novel genome rearrangement distance, called dupli-
cation distance, that counts the number of duplicate operations needed to construct a given
target sequence by copying and pasting substrings of a fixed source sequence. We present
several generalizations of this basic duplication model that allow also for certain types of
reversal and deletion operations. We give efficient algorithms for computing duplication
distance and its variants.

Most-Parsimonious Two-Step Duplication Tree We introduce an integer linear program
formulation of the problem of computing a putative history for a set of genomic regions
                                                                                         3

containing duplicates that is based on a parsimony criterion. We use our problem formu-
lation to compute a putative two-generation evolutionary history for a set of segmental
duplications in the human genome using duplication distance as a measure of parsimony.

Maximum Parsimony Evolutionary History DAG We introduce an optimization prob-
lem for computing an evolutionary history for a set of genomic regions containing dupli-
cates where a history is represented as a directed acyclic graph (DAG). We use our problem
formulation to compute a putative evolutionary history DAG for a set of human segmental
duplications using duplication distance as a measure of parsimony.

Maximum Likelihood Evolutionary History DAG We develop a probabilistic model
of segmental duplication based on a partition function of the weighted ensemble of dupli-
cation scenarios. We give an efficient algorithm for computing the partition function of
an ensemble of duplication scenarios. We introduce a likelihood analog of the Maximum
Parsimony Evolutionary History DAG optimization problem using our probabilistic model
to compute likelihood scores. We use our problem formulation to compute a putative evo-
lutionary history DAG for a set of human segmental duplications, and we compare the
likelihood and parsimony solutions.

Completing a Partially Assembled Genome Using a Rearrangement Distance We give
a simple algorithm for the block ordering problem that constructs a genome assembly from
a set of partially assembled contigs so as to maximize the similarity between the resulting
genome and a known reference genome. We present a linear-time algorithm for the prob-
lem when the measure of similarity is defined as the double-cut-and-join (DCJ) distance
between a pair of genomes. We further provide a proof that given a pair of genomes, the
number of cycles in their breakpoint graph is equal to the number of cycles in their adja-
cency graph. Finally, we suggest the problem of computing a most-parsimonious set of k
genomes that collectively exhibit a set of measured rearrangements, motivated by recent
paired-end sequencing studies of cancer genomes.

1.2   Related Work

Although the work presented in this doctoral thesis falls into the broad category of al-
gorithms for analyzing genome rearrangements, the techniques and models employed are
varied and draw inspiration from many seminal works. Our work includes both results in
the design of rearrangement models and algorithms for computing rearrangement distances
and results in the computational analysis of genetic data from the human genome. Here we
                                                                                           4

place our work into context with other related research efforts by other members of the
community.

1.2.1   Models of Genome Rearrangement

Genomes evolve via many types of mutations ranging in scale from single nucleotide mu-
tations to large genome rearrangements. In this thesis, we consider several types of large-
scale genome rearrangements that are caused by chromosomal ruptures. Computational
models of these mutational processes allow researchers to derive similarity measures, called
rearrangement distances, between genome sequences and to reconstruct evolutionary re-
lationships between genomes. For example, considering substring reversals as the only
type of mutation leads to the so-called reversal distance problem of finding the minimum
number of reversals that transform one genome into another [58, 53]. Developing genome
rearrangement models that are both biologically realistic and computationally tractable re-
mains an active area of research.

Traditionally, computational biologists model a genome as a string on an alphabet of syn-
teny blocks that may represent genes or other genomic sequences that are conserved (either
across multiple loci in a single genome or across multiple genomes exhibiting ortholo-
gous copies of the same sequence). In the literature, a genome that contains some synteny
block in duplicate is ambiguous and one without duplicates (i.e. a permutation) is non-
ambiguous. The first results in the area of genome rearrangement distances dealt with dis-
tances between non-ambiguous genomes. An early breakthrough in the study of genome
rearrangements is due to Hannenhalli and Pevzner [30] who introduced a polynomial-time
algorithm for computing the reversal distance between signed permutations (where every
integer has a +/- orientation) in which the only rearrangement event considered is a rever-
sal. For example, given a signed string X = +1 +2 +3 +4, a reversal of the substring,
+3 +4, yields X 0 = +1 +2 −4 −3. This result was later improved in [10] and then again
in [4]. (For discussion see cf. [50] and references therein).

Several elegant extensions of the reversal distance model have also been considered. For
example, [23] extends the theory of [30] to compute the distance between a pair of genomes
that may not necessarily contain the same set of blocks by allowing insertions and deletions
of substrings. For instance, the reversal distance between X1 = +1 +2 +3 +4 and X2 =
+5 −3 −2 +6 is undefined but [23] computes a reversal distance that also allows operations
that delete the blocks that only appear in one of the input genomes (i.e. blocks 1, 4, 5, and
                                                                                           5

6). This was later improved by [41].

Ambiguous genomes present a particular challenge for genome rearrangement analysis and
often make the underlying computational problems more difficult. For instance, computing
reversal distance in signed genomes with duplicates is NP-hard [19].

There have been, however, several efficient solutions given to problems involving genomes
with duplicates. For example, the genome halving problem has been solved. The input
to the problem is a genome with exactly two copies of every character and the goal is
to construct a minimum sequence of reversals that transforms the input genome into any
doubled genome equal to the concatenation of two identical non-ambiguous genomes. This
problem was first explored by [25] who gave a solution that was later discovered to be
incomplete and was corrected in [1].

Another type of rearrangement that has been used to compare ambiguous genomes is the
tandem duplication model in which an operation copies a substring of the genome and
reinserts it into the genome right next to itself. For example, [18] presented the tandem-
duplication random-loss model. In one operation, a substring of the genome is duplicated
and inserted right next to itself and then exactly one copy of each of the newly duplicated
integers is deleted. [18] gives exact, polynomial-time algorithms for special cost functions.
In [11], the authors consider tandem duplications in the context of inferring a most par-
simonious sequence of tandem duplication, gene loss, speciation, and reversal events that
is consistent with a given gene tree and such that the total number of reversals is mini-
mized. In [26], the authors consider the problem of constructing the duplication history
(i.e. phylogenetic tree) for a set of tandemly repeated genes. In a duplication tree, each
leaf of the tree corresponds to one of the present-day paralogous genes, and each internal
node corresponds to the duplication of either a single gene or a set of adjacent genes. They
give a simple method for determining whether a given rooted phylogeny is also a partially
ordered duplication history (i.e. agrees with the order of the genes). They also give an
exhaustive search method for finding the max parsimony duplication history. In [3], the
authors consider tandem, alpha-satellite repeats in the human genome. They construct a
probabilistic framework for evaluating the likelihood that a particular set of tandem repeats
evolved by the physical process of unequal crossover. Finally, in [20], the authors give
a polynomial-time approximation scheme (PTAS) for computing the optimal history (i.e.
duplication tree) of tandem duplications for a given ambiguous genome where nodes may
correspond to tandem duplications of contiguous substrings, and the cost on an edge is the
                                                                                            6

hamming distance between the two sequences at the endpoints.

There have also been many proposed models that compute the distance between a non-
ambiguous ancestral genome and a present-day ambiguous genome. In these models, the
sequence of transforming rearrangements must include operations that introduce arbitrary
duplicates into the genome. However, efficient algorithms to compute these distances ex-
actly are largely unknown.

For example, [24] gives a method for computing the minimum number of duplication
transpositions and reversals needed to transform any non-ambiguous ancestor into a given
present-day, ambiguous genome. The method is not efficient unless the present-day genome
contains no more than two copies of any duplicate, and even in this case, the algorithm pre-
sented in [24] is flawed. (See Appendix A for a discussion.) In [43], the authors give an
efficient approximation algorithm that computes the distance between the identity permu-
tation and an arbitrary (possibly ambiguous) genome under reversals, deletions, and dupli-
cating insertions. This work is extended by [56] who compute the distance between two
arbitrary (possibly ambiguous) genomes approximately. Unfortunately, no exact solution
for this problem is known. An exact algorithm to compute a minimum distance between
an arbitrary (ambiguous) genome and some non-ambiguous ancestral genome under dupli-
cation transpositions is given in [60], but the duplication transposition model relies on the
simplifying no-breakpoint-reuse assumption allowing a simple greedy method to suffice.

In Chapter 2, we discuss a novel rearrangement distance, called duplication distance, in-
troduced in [36], that models the duplication and transposition of contiguous genomic sub-
strings en bloc between disparate loci. The duplication distance from a source string x
to a target string y is the minimum number of substrings of x that can be sequentially
copied from x and pasted into an initially empty string in order to construct y. We de-
rive an efficient exact algorithm for computing the duplication distance between a pair of
strings. Note that the string x does not change during the sequence of duplication events.
Moreover, duplication distance does not model local rearrangements, like tandem duplica-
tions, deletions or inversions, that occur within a duplication block during its construction.
While such local rearrangements undoubtedly occur in genome evolution, the duplication
distance model focuses on identifying the duplicate operations that account for the con-
struction of repeated patterns within duplication blocks by aggregating substrings of other
duplication blocks from different loci. Thus, like nearly every other genome rearrange-
ment model, the duplication distance model makes some simplifying assumptions about
                                                                                               7

the underlying biology to achieve computational tractability. In [33, 35], we extended the
duplication distance measure to include certain types of deletions and inversions, and we
give polynomial-time exact algorithms for computing these extensions. The reversals we
consider only occur within a particular duplicated segment of the source string before being
inserted into the target; we do not allow arbitrary reversals to occur in the target string. The
deletions we consider, however, are arbitrary substring deletions that can occur in the tar-
get string at any time during a sequence of operations. These extensions make our model
less restrictive – although we still maintain the restriction that x does not change during
the sequence of duplications – and allows the construction of more rich, and perhaps more
biologically plausible, duplication scenarios. While not explicitly modeling every type of
rearrangement that might occur within a sequence of operations that builds a target string,
duplication distance (and its extensions) provide an approximation of how a sequence of
operations might occur and is efficiently computable in polynomial time.

Moreover, the abstraction we make by distinguishing the fixed source string from the
changing target string is inspired by a known biological process by which mosaic patterns
of segmental duplications are composed within mammalian genomes. Thus, the source
and target strings represent two distinct genomic regions that might possibly be on differ-
ent chromosomes. This process, known as the two-step model of segmental duplication is
discussed in greater detail in Chapter 3.

1.2.2   Multiple Genome Rearrangement Algorithms

Computing rearrangement histories for a set of more than two genomes is used in construct-
ing phylogenies of species and identifying orthologous genes in different species among
other applications. The simplest problem to define on a set of multiple genomes is the me-
dian problem with respect to a certain rearrangement distance. Given a set {G1 , G2 , . . . , Gk }
of genomes, the median problem is to find a genome H that minimizes ki=1 d(Gi , H)
                                                                            P

where d is some distance measure. The median problem has been shown to be NP-hard
with respect to reversal distance on signed permutations [17] and with respect to the sim-
pler breakpoint distance on both signed and unsigned permutations [49] where the break-
point distance between a pair of genomes is the number of character adjacencies that are
exhibited in one genome and not the other. A heuristic for computing a median permutation
with respect to breakpoint distance has been given by [55] and an approximation algorithm
for computing the median problem with respect to a special case of the tandem-duplication
random-loss distance was given by [18].
                                                                                            8

The problem of constructing a phylogenetic tree to represent a rearrangement history for
a set of known genomes of common ancestry has been well-studied. In the phylogenetic
tree problem, the leaves of the tree are the set of known genomes. For example, [12]
describes a heuristic (BPAnalysis) for computing the unknown ancestral genomes in a fixed
phylogenetic tree with a breakpoint distance criterion. The method is exponential in both
the number of genomes and the size of the genomes. In [22], the authors improve the
method presented in [12]; their method is only exponential in the number of genomes. This
was improved further by [46] in a tool called GRAPPA that also computed phylogenies
with respect to reversal distance. This was then refined by [47] who also increased the
speed by a factor of one million. In [16], the authors introduce a new heuristic (MGR)
for computing phylogenies with respect to reversal distance that is shown to perform better
than the method given in [47] on real data.

1.2.3   Analysis of Duplicated Genomic Regions

Computational biologists use genome rearrangement distances to infer evolutionary rela-
tionships between species or to infer the history of genomic regions of interest. Many
computational biologists are interested in reconstructing the histories of regions containing
duplicated segments. For example, in [8], the authors construct a phylogeny for a set of
species using regions containing orthologous repeats under the “no homoplasy” assump-
tion.

The Alu family of repetitive elements has been studied in detail as certain Alu insertions
or mutations have been linked to several human diseases. In [45], the authors do a limited
Alu phylogeny reconstruction by recursively computing a maximum likelihood partition of
the elements. In [52], the authors partition Alu repeats into 213 subfamilies by recursively
splitting subfamilies whose members fail a statistical uniformity test. They look for pairs
of non-consensus nucleotide values at distinct positions. This allows them to find nested
subfamilies (which is impossible using the method of [45]). Finally, they build an evolu-
tionary tree of the subfamilies by computing a minimum spanning tree (MST) with respect
to the Hamming distance between subfamily consensus sequences.

In [48], the authors present a randomized method for computing the most likely phylogeny
of large sets (∼1,000,000) of mobile elements. The authors assume that only a small num-
ber of the elements actively replicate and that all the resultant copies are highly similar on
the sequence level and, therefore, elude distance-based clustering methods. Their method
                                                                                             9

partitions the elements using a randomized clustering algorithm (not based on EM due to
its slow convergence). It is an extension of the method presented in [52]; the recursive
splitting into subfamilies is done by randomly testing pairs of positions for correlation (in
lieu of exhaustive testing), requiring only time that is linear in the repeat sequence length.

In Chapter 3, we use duplication distance to analyze a set of regions of the human genome
that contain segmental duplications. First, we find the most parsimonious duplication sce-
nario consistent with the so-called two-step model of segmental duplication using dupli-
cation distance as our measure of parsimony. We then refine our notion of a duplication
history and compute duplication history DAGs for the regions containing segmental dupli-
cations using both a parsimony and a likelihood criterion.
            DUPLICATION DISTANCE: A
         COMBINATORIAL MODEL OF
          SEGMENTAL DUPLICATIONS

In this chapter, we introduce a novel measure of similarity between genomic regions con-
taining repeated elements, duplication distance. We begin by reviewing some definitions
and notation that were introduced in [36] and [37]. Let ∅ denote the empty string. For a
string x = x1 . . . xn , let xi,j denote the substring xi xi+1 . . . xj . We define a subsequence
S of x to be a string xi1 xi2 . . . xik with i1 < i2 < · · · < ik . We represent S by listing
the indices at which the characters of S occur in x. For example, if x = abcdef , then the
subsequence S = (1, 3, 5) is the string ace. Note that every substring is a subsequence,
but a subsequence need not be a substring since the characters comprising a subsequence
need not be contiguous. For a pair of subsequences S1 , S2 , denote by S1 ∩ S2 the maximal
subsequence common to both S1 and S2 .
Definition 1. Subsequences S = (s1 , s2 ) and T = (t1 , t2 ) of a string x are alternating in x
if either s1 < t1 < s2 < t2 or t1 < s1 < t2 < s2 .
Definition 2. Subsequences S = (s1 , . . . , sk ) and T = (t1 , . . . , tl ) of a string x are over-
lapping in x if there exist indices i, i0 and j, j 0 such that 1 ≤ i < i0 ≤ k, 1 ≤ j < j 0 ≤ l,
and (si , si0 ) and (tj , tj 0 ) are alternating in x. See Fig. 2.1.
Definition 3. Given subsequences S = (s1 , . . . , sk ) and T = (t1 , . . . , tl ) of a string x, S is
inside of T if there exists an index i such that 1 ≤ i < l and ti < s1 < sk < ti+1 . That is,
the entire subsequence S occurs in between successive characters of T . See Fig. 2.2.
Definition 4. A duplicate operation from x, δx (s, t, p), copies a substring xs . . . xt of the
source string x and pastes it into a target string at position p. Specifically, if x = x1 . . . xm
and Z = z1 . . . zn , then
Z ◦ δx (s, t, p) = z1 . . . zp−1 xs . . . xt zp . . . zn . See Fig. 2.3.

                                                       10
                                                                                                                 11

                                                                                  ti                ti+1
                  si        tj       si'       tj'                   X
         X

                                                                     Figure 2.2: The red subsequence is inside the
Figure 2.1: The red subsequence is overlap-                          blue subsequence T . All the characters of the
ping with the blue subsequence in x. The in-                         red subsequence occur between the indices ti
dices (si , si0 ) and (tj , tj 0 ) are alternating in x.             and ti+1 of T .

                                           s           t
                                 X


                                                     Z ° !x(s,t,p)
                                 Z
                                           p                                           p


Figure 2.3: A duplicate operation, δx (s, t, p). A substring xs xs+1 . . . xt of the source string x is
copied and inserted into the target string Z at index p.


Definition 5. The duplication distance from a source string x to a target string y is the
minimum number of duplicate operations from x that generates y from an initially empty
target string 1 . That is, y = ∅ ◦ δx (s1 , t1 , p1 ) ◦ δx (s2 , t2 , p2 ) ◦ · · · ◦ δx (sl , tl , pl ).

2.1          The Basic Recurrence

In this section we review the basic recurrence for computing duplication distance that was
introduced in [37]. The recurrence examines the characters of the target string, y, and
considers the sets of characters of y that could have been generated, or copied from the
source string in a single duplicate operation. Such a set of characters of y necessarily
correspond to a substring of the source x (see Def. 4). Moreover, these characters must
be a subsequence of y. This is because, in a sequence of duplicate operations, once a
string is copied and inserted into the target string, subsequent duplicate operations do not
affect the order of the characters in the previously inserted string. Because every character
of y is generated by exactly one duplicate operation, a sequence of duplicate operations
that generates y partitions the characters of y into disjoint subsequences, each of which
is generated in a single duplicate operation. A more interesting observation is that these
subsequences are mutually non-overlapping. We formalize this property as follows.
Lemma 1 (Non-overlapping Property). Consider a source string x and a sequence of du-
plicate operations of the form δx (si , ti , pi ) that generates the final target string y from an
   1
       We assume that every character in y appears at least once in x.
                                                                                                    12

initially empty target string. The substrings xsi ,ti of x that are duplicated during the con-
struction of y appear as mutually non-overlapping subsequences of y.

Proof: Consider a sequence of duplicate operations δx (s1 , t1 , p1 ), . . . , δx (sk , tk , pk ) that
generates y from an initially empty target string. For 1 ≤ i ≤ k, Let Z i be the intermediate
target string that results from δx (s1 , t1 , p1 ) ◦ · · · ◦ δx (si , ti , pi ). Note that Z k = y. For
j ≤ i, let Sji be the subsequence of Z i that corresponds to the characters duplicated by
the j th operation. We shall show by induction on the length i of the sequence that that
S1i , S2i , . . . , Sii are pairwise non-overlapping subsequences of Z i . For the base case, when
there is a single duplicate operation, there is no non-overlap property to show. Assume now
that S1i−1 , . . . Si−1
                    i−1
                        are mutually non-overlapping subsequences in Z i−1 . For the induction
step note that, by the definition of a duplicate operation, Si is inserted as a contiguous
substring into Z i−1 at location pi to form Z i . Therefore, for any j, j 0 < i, if Sji−1 and
Sji−1
   0  are non overlapping in Z i−1 then Sji and Sji0 are non overlapping in Z i . It remains
to show that for any j < i Sji and Sii are non-overlapping in Z i . There are two cases: (1)
the elements of Sji are either all smaller or all greater than the elements of Sii or (2) Sii is
inside of Sji in Z i (Definition 3). In either case, Sj and Si are not overlapping in Z i as
required.                                                                                           

The non-overlapping property leads to an efficient recurrence that computes duplication
distance. When considering subsequences of the final target string y that might have been
generated in a single duplicate operation, we rely on the non-overlapping property to iden-
tify substrings of y that can be treated as independent subproblems. If we assume that some
subsequence S of y is produced in a single duplicate operation, then we know that all other
subsequences of y that correspond to duplicate operations cannot overlap the characters in
S. Therefore, the substrings of y in between successive characters of S define subproblems
that are computed independently.

In order to find the optimal (i.e. minimum) sequence of duplicate operations that generate
y, we must consider all subsequences of y that could have been generated by a single
duplicate operation. The recurrence is based on the observation that y1 must be the first
(i.e. leftmost) character to be copied from x in some duplicate operation. There are then
two cases to consider: either (1) y1 was the last (or rightmost) character in the substring that
was duplicated from x to generate y1 , or (2) y1 was not the last character in the substring
that was duplicated from x to generate y1 .

The recurrence defines two quantities: d(x, y) and di (x, y). We shall show, by induction,
                                                                                                     13

                                                i
                                    X

                                    Y
                                          1


                                    X

                                    Y

 Figure 2.4: y1 is generated from xi in a duplicate operation where y1 is the last (rightmost) character
 in the copied substring (Case 1). The total duplication distance is one plus the duplication distance
 for the suffix y2,|y| .


 that for a pair of strings, x and y, the value d(x, y) is equal to the duplication distance from
 x to y and that di (x, y) is equal to the duplication distance from x to y under the restriction
 that the character y1 is copied from index i in x, i.e. xi generates y1 . d(x, y) is found by
 considering the minimum among all characters xi of x that can generate y1 , see Eq. 2.1.

 As described above, we must consider two possibilities in order to compute di (x, y). Either:

Case 1 : y1 was the last (or rightmost) character in the substring of x that was copied to
        produce y1 , (see Fig. 2.4), or

Case 2 : xi+1 is also copied in the same duplicate operation as xi , possibly along with other
        characters as well (see Fig. 2.5).

 For case one, the minimum number of duplicate operations is one – for the duplicate that
 generates y1 – plus the minimum number of duplicate operations to generate the suffix
 of y, giving a total of 1 + d(x, y2,|y| ) (Fig. 4). For case two, Lemma 1 implies that the
 minimum number of duplicate operations is the sum of the optimal numbers of operations
 for two independent subproblems. Specifically, for each j > 1 such that xi+1 = yj we
 compute: (i) the minimum number of duplicate operations needed to build the substring
 y2,j−1 , namely d(x, y2,j−1 ), and (ii) the minimum number of duplicate operations needed
 to build the string y1 yj,|y| , given that y1 is generated by xi and yj is generated by xi+1 . To
 compute the latter, recall that since xi and xi+1 are copied in the same duplicate operation,
 the number of duplicates necessary to generate y1 yj,|y| using xi and xi+1 is equal to the
 number of duplicates necessary to generate yj,|y| using xi+1 , namely di+1 (x, yj,|y| ), (see
                                                                                                    14

Fig. 2.5 and Eq. 2.2).

The recurrence is, therefore:


            d(x, ∅) = 0
            d(x, y) =        min di (x, y)                                                       (2.1)
                           {i:xi =y1 }

           di (x, ∅) = 0
                                 (
                                     1 + d(x, y2,|y| )
           di (x, y) = min                                                                       (2.2)
                                     min{j:yj =xi+1 ,j>1} {d(x, y2,j−1 ) + di+1 (x, yj,|y| )}

Theorem 1. d(x, y) is the minimum number of duplicate operations that generate y from x.
For {i : xi = y1 }, di (x, y) is the minimum number of duplicate operations that generate y
from x such that y1 is generated by xi .

Proof: Let OP T (x, y) denote minimum length of a sequence of duplicate operations that
generate y from x. Let OP Ti (x, y) denote the minimum length of a sequence of operations
that generate y from x such that y1 is generated by xi . We prove by induction on | y | that
d(x, y) = OP T (x, y) and di (x, y) = OP Ti (x, y).

For | y |= 1, since we assume there is at least one i for which xi = y1 , OP T (x, y) =
OP Ti (x, y) = 1. By definition, the recurrence also evaluates to 1. For the inductive step,

                                                  i i+1
                                         X

                                         Y
                                             1       j


                                                                         i+1
             X                                            X

             Y                                            Y
                                                                          j


Figure 2.5: y1 is generated from xi in a duplicate operation where y1 is not the last (rightmost)
character in a copied substring (Case 2). In this case, xi+1 is also copied in the same duplicate
operation (top). Thus, the duplication distance is the sum of d(x, y2,j−1 ), the duplication distance
for y2,j−1 (bottom left), and di+1 (x, yj,|y| ), the minimum number of duplicate operations to generate
yj,|y| given that xi+1 generates yj (bottom right).
                                                                                               15

assume that OP T (x, y0 ) = d(x, y0 ) and OP Ti (x, y0 ) = di (x, y0 ) for any string y0 shorter
than y. We first show that OP Ti (x, y) ≤ di (x, y). Since OP T (x, y) = mini OP Ti (x, y),
this also implies OP T (x, y) ≤ d(x, y). We describe different sequences of duplicate oper-
ations that generate y from x, using xi to generate y1 :

   • Consider a minimum-length sequence of duplicates that generates y2,|y| . By the in-
      ductive hypothesis its length is d(x, y2,|y| ). By duplicating y1 separately using xi we
      obtain a sequence of duplicates that generates y whose length is 1 + d(x, y2,|y| ).

   • For every {j : yj = xi+1 , j > 1} consider a minimum-length sequence of dupli-
      cates that generates yj,|y| using xi+1 to produce yj , and a minimum-length sequence
      of duplicates that generates y2,j−1 . By the inductive hypothesis their lengths are
      di+1 (x, yj,|y| ) and d(x, y2,j−1 ) respectively. By extending the start index s of the du-
      plicate operation that starts with xi+1 to produce yj to start with xi and produce y1 as
      well, we produce y with the same number of duplicate operations.

Since OP Ti (x, y) is at most the length of any of these options, it is also at most their
minimum. Hence,
                                (
                                    1 + d(x, y2,|y| )
        OP Ti (x, y) ≤ min
                                    min{j:yj =xi+1 ,j>1} {d(x, y2,j−1 ) + di+1 (x, yj,|y| )}
                        = di (x, y).


To show the other direction (i.e. that d(x, y) ≤ OP T (x, y) and di (x, y) ≤ OP Ti (x, y)),
consider a minimum-length sequence of duplicate operations that generate y from x, using
xi to generate y1 . There are a few cases:

   • If y1 is generated by a duplicate operation that only duplicates xi , then OP Ti (x, y) =
      1 + OP T (x, y2,|y| ). By the inductive hypothesis this equals 1 + d(x, y2,|y| ) which is
      at least di (x, y).

   • Otherwise, y1 is generated by a duplicate operation that copies xi and also duplicates
      xi+1 to generate some character yj . In this case the sequence ∆ of duplicates that
      generates y2,j−1 must appear after the duplicate operation that generates y1 and yj
      because y2,j−1 is inside (Definition 3) of (y1 , yj ). Without loss of generality, sup-
      pose ∆ is ordered after all the other duplicates so that first y1 yj . . . y|y| is gener-
      ated, and then ∆ generates y2 . . . yj−1 between y1 and yj . Hence, OP Ti (x, y) =
      OP Ti (x, y1 yj,|y| ) + OP T (x, y2,j−1 ). Since in the optimal sequence xi generates y1 in
                                                                                                 16

      the same duplicate operation that generates yj from xi+1 , we have
      OP Ti (x, y1 yj,|y| ) = OP Ti+1 (x, yj,|y| ). By the inductive hypothesis,
      OP T (x, y2,j−1 ) + OP Ti+1 (x, yj,|y| ) = d(x, y2,j−1 ) + di+1 (x, yj,|y| ) which is at least
      di (x, y).

                                                                                                 
This recurrence naturally translates into a dynamic programing algorithm that computes
the values of d(x, ·) and di (x, ·) for various target strings. To analyze the running time of
this algorithm, note that both y2,j and yj,|y| are substrings of y. Since the set of substrings
of y is closed under taking substrings, we only encounter substrings of y. Also note that
since i is chosen from the set {i : xi = y1 }, there are O(µ(x)) choices for i, where µ(x) is
the maximal multiplicity of a character in x. Thus, there are O(µ(x) | y |2 ) different values
to compute. Each value is computed by considering the minimization over at most µ(y)
previously computed values, so the total running time is bounded by O(| y |2 µ(x)µ(y)),
which is O(| y |3 | x |) in the worst case. We note that for applications where the size of
the alphabet on which the strings are built is large with respect to the length of the strings,
such that µ(x) ∈ O(1) and µ(y) ∈ O(1), the running time of the algorithm is O(| y |2 )
in the worst case. As with most dynamic programming approaches, this algorithm (and all
others presented in subsequent sections) can be extended through trace-back to reconstruct
the optimal sequence of operations needed to build y. We omit the details.

2.2    Extending to Affine Duplication Cost

It is easy to extend the recurrence relations in Eqs. (2.1), (2.2) to handle costs for duplicate
operations. In the above discussion, the cost of each duplicate operation is 1, so the sum
of costs of the operations in a sequence that generates a string y is just the length of that
sequence. We next consider a more general cost model for duplication in which the cost of
a duplicate operation δx (s, t, p) is ∆1 + (t − s + 1)∆2 (i.e., the cost is affine in the number
of duplicated characters). Here ∆1 , ∆2 are some non-negative constants. This extension is
obtained by assigning a cost of ∆2 to each duplicated character, except for the last character
in the duplicated string, which is assigned a cost of ∆1 + ∆2 . We do that by adding a cost
term to each of the cases in Eq. 2.2. If xi is the last character in the duplicated string (case
1), we add ∆1 + ∆2 to the cost. Otherwise xi is not the last duplicated character (case 2),
                                                                                                        17

so we add just ∆2 to the cost. Eq. (2.2) thus becomes
                          (
                             ∆1 + ∆2 + d(x, y2,|y| )
       di (x, y) = min                                                                               (2.3)
                             min{j:yj =xi+1 ,j>1} {d(x, y2,j−1 ) + di+1 (x, yj,|y| ) + ∆2 }

The running time analysis for this recurrence is the same as for the one with unit duplication
cost.

2.3     Extending the Model: Duplication-Deletion Distance

Here we provide several extensions to the duplication distance model; we generalize the
model to allow also for certain types of substring deletions in the target string.

Consider the intermediate string Z generated after some number of duplicate operations. A
deletion operation removes a contiguous substring zi , . . . , zj of Z, and subsequent duplicate
and deletion operations are applied to the resulting string.
Definition 6. A delete operation, τ (s, t), deletes a substring zs . . . zt of the target string
Z, thus making Z shorter. Specifically, if Z = z1 . . . zs . . . zt . . . zm , then Z ◦ τ (s, t) =
z1 . . . zs−1 zt+1 . . . zm . See Figure 6.

The cost associated with τ (s, t) depends on the number t − s + 1 of characters deleted and
is denoted Φ(t − s + 1).

                                     s        t             Z ° !(s,t)
                            Z


                Figure 2.6: A delete operation, τ (s, t). The substring Zs,t is deleted.

Definition 7. The duplication-deletion distance from a source string x to a target string y
is the cost of a minimum sequence of duplicate operations from x and deletion operations,
in any order, that generates y.

We now show that although we allow arbitrary deletions from the intermediate string, it
suffices to consider deletions from the duplicated strings before they are pasted into the
intermediate string, provided that the cost function for deletion, Φ(·) is non-decreasing and
obeys the triangle inequality.
Definition 8. A duplicate-delete operation from x, ηx (i1 , j1 , i2 , j2 , . . . , ik , jk , p), for i1 ≤
j1 < i2 ≤ j2 < · · · < ik ≤ jk copies the subsequence xi1 . . . xj1 xi2 . . . xj2 . . . . . . xik . . . xjk
of the source string x and pastes it into a target string at position p. Specifically, if
x = x1 . . . xm and Z = z1 . . . zn , then Z ◦ ηx (i1 , j1 , . . . , ik , jk , p) =
                                                                                                 18

z1 . . . zp−1 xi1 . . . xj1 xi2 . . . xj2 . . . xik . . . xjk zp . . . zn .


The cost associated with such a duplicate-delete is ∆1 +(jk −i1 +1)∆2 +                 Φ(i`+1 −j` −
                                                                               Pk−1
                                                                                  `=1
1). The first two terms in the cost reflect the affine cost of duplicating an entire substring of
length jk − i1 + 1, and the second term reflects the cost of deletions made to that substrings.
Lemma 2. If the affine cost for duplications is non-decreasing and Φ(·) is non-decreasing
and obeys the triangle inequality then the cost of a minimum sequence of duplicate and
delete operations that generates a target string y from a source string x is equal to the cost
of a minimum sequence of duplicate-delete operations that generates y from x.

Proof: Since duplicate operations are a special case of duplicate-delete operations, the cost
of a minimal sequence of duplicate-delete operations and delete operations that generates
y cannot be more than that of a sequence of just duplicate operations and delete operations.
We show the (stronger) claim that an arbitrary sequence of duplicate-delete and delete op-
erations that produces a string y with cost c can be transformed into a sequence of just
duplicate-delete operations that generates y with cost at most c by induction on the num-
ber of delete operations. The base case, where the number of deletions is zero, is trivial.
Consider the first delete operation, τ . Let k denote the number of duplicate-delete opera-
tions that precede τ , and let Z be the intermediate string produced by these k operations.
For i = 1, . . . , k, let Si be the subsequence of x that was used in the ith duplicate-delete
operation. By lemma 1, S1 , . . . , Sk form a partition of Z into disjoint, non-overlapping
subsequences of Z. Let d denote the substring of Z to be deleted. Since d is a contiguous
substring, Si ∩ d is a (possibly empty) substring of Si for each i. There are several cases:

    1. Si ∩ d = ∅. In this case we do not change any operation.

    2. Si ∩ d = Si . In this case all characters produced by the ith duplicate-delete operation
        are deleted, so we may omit the ith operation altogether and decrease the number of
        characters deleted by τ . Since Φ(·) is non-decreasing, this does not increase the cost
        of generating Z (and hence y).

    3. Si ∩ d is a prefix (or suffix) of Si . Assume it is a prefix. The case of suffix is similar.
        Instead of deleting the characters Si ∩ d we can avoid generating them in the first
        place. Let r be the smallest index in Si \ d (that is, the first character in Si that is not
        deleted by τ ). We change the ith duplicate-delete operation to start at r and decrease
        the number of characters deleted by τ . Since the affine cost for duplications is non-
        decreasing and Φ(·) is non-decreasing, the cost of generating Z does not increase.
                                                                                                       19

   4. Si ∩d is a non-empty substring of Si that is neither a prefix nor a suffix of Si . We claim
      that this case applies to at most one value of i. This implies that after taking care of
      all the other cases τ only deletes characters in Si . We then change the ith duplicate-
      delete operation to also delete the characters deleted by τ , and omit τ . Since Φ(·)
      obeys the triangle inequality, this will not increase the total cost of deletion. By the
      inductive hypothesis, the rest of y can be generated by just duplicate-delete opera-
      tions with at most the same cost. It remains to prove the claim. Recall that the set
      {Si } is comprised of mutually non-overlapping subsequences of Z. Suppose that
      there exist indices i 6= j such that Si ∩ d is a non-prefix/suffix substring of Si and
      Sj ∩ d is a non-prefix/suffix substring of Sj . There must exist indices of both Si and
      Sj in Z that precede d, are contained in d, and succeed d. Let ip < ic < is be three
      such indices of Si and let jp < jc < js be similar for Sj . It must be the case also
      that jp < ic < js and ip < jc < is . Without loss of generality, suppose ip < jp . It
      follows that (ip , ic ) and (jp , js ) are alternating in Z. So, Si and Sj are overlapping
      which contradicts Lemma 1.

                                                                                                       
To extend the basic recurrence to duplication-deletion distance, we must observe that be-
cause we allow deletions in the string that is duplicated from x, if we assume character xi
is copied to produce y1 , it may not be the case that the character xi+1 also appears in y; the
character xi+1 may have been deleted. Therefore, we minimize over all possible locations
k > i for the next character in the duplicated string that is not deleted. The extension of the
recurrence from the previous section to duplication-deletion distance is:

      ˆ ∅) = 0 , d(x,
      d(x,       ˆ y) = min dˆi (x, y),                                                              (2.4)
                                      {i:xi =y1 }

     dˆi (x, ∅) = 0 ,
                          
                                      ˆ
                           ∆1 + ∆2 + d(x, y2,|y| ),
                          
                          
                                                        (                                    )
     dˆi (x, y) = min                                       ˆ y
                                                            d(x,           ˆ
                                                                 2,j−1 ) + dk (x, yj,|y| )           (2.5)
                           mink>i min{j:yj =xk ,j>1}                                            .
                          
                                                            (k − i)∆2 + Φ(k − i − 1)
                          


Theorem 2. d(x,   ˆ y) is the duplication-deletion distance from x to y. For {i : xi = y1 },
dˆi (x, y) is the duplication-deletion distance from x to y under the additional restriction that
y1 is generated by xi .

The proof of Theorem 2 is an extension to that of Theorem 1. However, the running time
increases; while the number of entries in the dynamic programming table does not change,
                                                                                            20

the time to compute each entry is multiplied by the possible values of k in the recurrence,
which is O(| x |). Therefore, the running time is O(| y |2 | x | µ(x)µ(y)), which is
O(| y |3 | x |2 ) in the worst case.

We now show, in the following lemma, that if both the duplicate and delete cost functions
are the identity function (i.e. one per operation), then the duplication-deletion distance is
equal to duplication distance without deletions.
Lemma 3. Given a source string x, a target string y, If the cost of duplication is 1 per
                                                                              ˆ y) =
duplicate operation, and the cost of deletion is 1 per delete operation, then d(x,
d(x, y).

Proof: First we note that if a target string y can be built from x in d(x, y) duplicate opera-
tions, then the same sequence of duplicate operations is a valid sequence of duplicate and
                                                  ˆ y).
delete operations as well, so d(x, y) is at least d(x,

We claim that every sequence of duplicate and delete operations can be transformed into
a sequence of duplicate operations of the same length. The proof of this claim is similar
to that of Lemma 2. In that proof we showed how to transform a sequence of duplicate
and delete operations into a sequence of duplicate-delete operations of at most the same
cost. We follow the same steps, but transform the sequence into an a sequence that consists
of just duplicate operations without increasing the number of operations. Recall the four
cases in the proof of Lemma 2. In the the first three cases we eliminate the delete operation
without increasing the number of duplicate operations. Therefore we only need to consider
the last case (Si ∩ d is a non-empty substring of Si that is neither a prefix nor a suffix of
Si ). Recall that this case applies to at most one value of i. Deleting Si ∩ d from Si leaves a
prefix and a suffix of Si . We can therefore replace the ith duplicate operation and the delete
operation with two duplicate operations, one generating the appropriate prefix of Si and the
other generating the appropriate suffix of Si . This eliminates the delete operation without
changing the number of operations in the sequence. Therefore, for any string y that results
from a sequence of duplicate and delete operations, we can construct the same string using
only duplicate operations (without deletes) using at most the same number of operations.
                               ˆ y).
So, d(x, y) is no greater than d(x,                                                   

2.4    Extending the Model: Duplication-Inversion Distance

Here we generalize the duplication distance model to allow also for substring inversions.
We now explicitly define characters and strings as having two orientations: forward (+) and
                                                                                             21

inverse (-).
Definition 9. A signed string of length m over an alphabet Σ is an element of ({+, −} ×
Σ)m .

For example, (+b −c −a +d) is a signed string of length 4. An inversion of a signed string
reverses the order of the characters as well as their signs. Formally,
Definition 10. The inverse of a signed string x = x1 . . . xm is a signed string x¯ =
−xm . . . −x1 .

For example, the inverse of (+b −c −a +d) is (−d +a +c −b).

In a duplicate-invert operation a substring is copied from x and inverted before being in-
serted into the target string y. We allow the cost of inversion to be an affine function in the
length ` of the duplicated inverted string, which we denote Θ1 + `Θ2 , where Θ1 , Θ2 ≥ 0.
We still allow for normal duplicate operations.
Definition 11. A duplicate-invert operation from x, δ¯x (s, t, p), copies an inverted substring
−xt , −xt−1 . . . , −xs of the source string x and pastes it into a target string at position p.
Specifically, if x = x1 . . . xm and Z= z1 . . . zn , then Z◦ δ¯x (s, t, p) =
z1 . . . zp−1 x¯t x¯t−1 . . . x¯s zp . . . zn .

The cost associated with each duplicate-invert operation is Θ1 + (t − s + 1)Θ2 .
Definition 12. The duplication-inversion distance from a source string x to a target string
y is the cost of a minimum sequence of duplicate and duplicate-invert operations from x, in
any order, that generates y.

The recurrence for duplication distance (Eqs. 2.1, 2.3) can be extended to compute the
duplication-inversion distance. This is done by introducing a term for inverted duplications
whose form is very similar to that of the term for regular duplication (Eq. 2.3). Specifically,
when considering the possible characters to generate y1 , we consider characters in x that
match either y1 or its inverse, −y 1 . In the former case, then, we use d¯+ (x, y) to denote
                                                                             i
the duplication-inversion distance with the additional restriction that y1 is generated by xi
without an inversion. The recurrence for d¯+ is the same as for di in Eq. 2.3. In the latter
                                                  i
case, we consider an inverted duplicate in which y1 is generated by −xi . This is denoted
by d¯− , which follows a similar recurrence. In this recurrence, since an inversion occurs, xi
      i
is the last character of the duplicated string, rather than the first one. Therefore, the next
character in x to be used in this operation is −xi−1 rather than xi+1 . The recurrence for d¯−i
also differs in the cost term, where we use the affine cost of the duplicate-invert operation.
                                                                                                                 22

The extension of the recurrence to duplication-inversion distance is therefore:


                                                                                                      
       ¯ ∅) = 0 , d(x,
       d(x,       ¯ y) = min                        min         d¯+                           d¯−
                                                                  i (x, y),      min            i (x, y)   ,
                                                  {i:xi =y1 }                 {i:xi =−y 1 }

      d¯+
        i (x, ∅) = 0 ,  d¯− (x, ∅) = 0,
                        i
                        ∆1 + ∆2 + d(x, ¯ y ),
                                           2,|y|
       ¯+
      di (x, y) = min                      n                                                  o
                        min{j:y =x ,j>1} d(x,   ¯   y       ) +  ¯
                                                                  d +
                                                                        (x, y       ) +  ∆ 2    ,
                                  j  i+1              2,j−1         i+1       j,|y|
                       
                        Θ1 + Θ2 + d(x, ¯ y ),
                                           2,|y|
       ¯−
      di (x, y) = min                         n                                                o
                        min{j:y =−x ,j>1} d(x,    ¯   y       ) +  d¯−
                                                                          (x, y       ) + Θ  2    .
                                  j    i−1               2,j−1        i−1       j,|y|

                                                                                                               (2.6)

Theorem 3. d(x,  ¯ y) is the duplication-inversion distance from x to y. For {i : xi = y1 },
d¯+ (x, y) is the duplication-inversion distance from x to y under the additional restriction
 i
that y1 is generated by xi . For {i : xi = −y 1 }, d¯−
                                                     i (x, y) is the duplication-inversion
distance from x to y under the additional restriction that y1 is generated by −xi .

The correctness proof is very similar to that of Theorem 1, only requiring an additional
case for handling the case of a duplicate invert operation which is symmetric to the case of
regular duplication. The asymptotic running time of the corresponding dynamic program-
ming algorithm is O(| y |2 µ(x)µ(y)). The analysis is identical to the one in section ??.
The fact that we now consider either a duplicate or a duplicate-invert operation does not
change the asymptotic running time.

2.5      Extending the Model: Duplication-Inversion-Deletion Distance

Here we extend the distance measure to include delete operations as well as duplicate and
duplicate-invert operations. Note that we only handle deletions after inversions of the same
substring. The order of operations might be important, at least in terms of costs. The cost of
inverting (+a +b +c) and then deleting −b may be different than the cost of first deleting
+b from (+a +b +c) and then inverting (+a +c).
Definition 13. The duplication-inversion-deletion distance from a source string x to a tar-
get string y is the cost of a minimum sequence of duplicate and duplicate-invert operations
from x and deletion operations, in any order, that generates y.
Definition 14. A duplicate-invert-delete operation from x,
η¯x (i1 , j1 , i2 , j2 , . . . , ik , jk , p), for i1 ≤ j1 < i2 ≤ j2 < · · · < ik ≤ jk pastes the string
                                                                                                             23

−xjk −xjk −1 . . . −xik −xjk−1 −xjk−1 −1 . . . −xik−1 . . . . . . −xj1 −xj1 −1 . . . −xi1
into a target string at position p. Specifically, if x = x1 . . . xm and Z= z1 . . . zn , then
Z◦ η¯x (i1 , j1 , i2 , j2 , . . . , ik , jk , p) = z1 . . . zp−1 −xjk −xjk −1 . . . −xik −xjk−1 −xjk−1 −1 . . .
−xik−1 . . . . . . −xj1 −xji −1 . . . −xi1 zp . . . zn .

The cost of such an operation is Θ1 + (jk − i1 + 1)Θ2 +                        Φ(i`+1 − j` − 1). Similar to
                                                                      Pk−1
                                                                         `=1
the previous section, it suffices to consider just duplicate-invert-delete and duplicate-delete
operations, rather than duplicate, duplicate-invert and delete operations.
Lemma 4. If Φ(·) is non-decreasing and obeys the triangle inequality and if the cost of
inversion is an affine non-decreasing function as defined above, then the cost of a minimum
sequence of duplicate, duplicate-invert and delete operations that generates a target string
y from a source string x is equal to the cost of a minimum sequence of duplicate-delete and
duplicate-invert-delete operations that generates y from x.

The proof of the lemma is essentially the same as that of Lemma 2. Note that in that proof
we did not require all duplicate operations to be from the same string x. Therefore, the
arguments in that proof apply to our case, where we can regard some of the duplicates from
x and some from the inverse of x.

The recurrence for duplication-inversion-deletion distance is obtained by combining the re-
currences for duplication-deletion (Eq. 2.5) and for duplication-inversion distance (Eq. 2.6).
We use separate terms for duplicate-delete operations (dˆ¯+i ) and for duplicate-invert-delete
operations (d¯ˆ− ). Those terms differ from the terms in Eq. 2.6 in the same way Eq. 2.5
                 i
differs from Eq. 2.2; Because of the possible deletion we do not know that xi+1 (xi−1 ) is
the next duplicated character. Instead we minimize over all characters later (earlier) than
xi .

The recurrence for duplication-inversion-deletion distance is therefore:
                                                                        
     ˆ
     ¯ ∅) = 0 , d(x,
    d(x,                ˆ
                        ¯ y) = min               ˆ                   ˆ
                                          min d¯i (x, y), min d¯i (x, y) ,
                                                  +                   −
                                                  {i:xi =y1 }           {i:xi =−y 1 }

       dˆ¯+
          i (x, ∅) = 0 ,  dˆ¯− (x, ∅) = 0,
                          i
                                            ˆ¯
                          ∆1 + ∆2 + d(x, y2,|y| ), (
                         
                         
       dˆ¯+                                           ˆ¯ y              ˆ¯+ (x, y )
                                                                                         )
          i (x, y) = min                              d(x,   2,j−1  ) + d k      j,|y|
                          mink>i min{j:yj =xk ,j>1} (k − i)∆ + Φ(k − i − 1) ,
                         
                         
                                                                  2
                         
                                           ˆ¯
                          Θ1 + Θ2 + d(x, y2,|y| ), (
                         
                         
       dˆ¯−                                             ˆ¯ y               ˆ¯−
                                                                                           )
          i (x, y) = min                                d(x,   2,j−1 ) + dk (x, yj,|y| )
                          mink<i min{j:yj =−xk ,j>1} +(i − k)Θ + Φ(i − k − 1) .
                         
                         
                                                                                 2
                                                                                             24


Theorem 4. d(x,  ˆ¯ y) is the duplication-inversion-deletion distance from x to y. For {i : x =
                                                                                             i
       ˆ
y1 }, d¯i (x, y) is the duplication-inversion-deletion distance from x to y under the additional
        +

restriction that y is generated by x . For {i : x = −y }, d¯ˆ− (x, y) is the duplication-
                  1                    i             i        1    i
inversion-deletion distance from x to y under the additional restriction that y1 is generated
by −xi .

The proof, again, is very similar to the proofs in the previous sections. The running time
of the corresponding dynamic programming algorithm is the same (asymptotically) as that
of duplication-deletion distance. It is O(| y |2 | x | µ(y)µ(x)), where the multiplicity µ(y)
(or µ(x)) is the number of times a character appears in the string y (or x), regardless of its
sign.

In comparing the models of the previous section and the current one, we note that restricting
the model of rearrangement to allow only duplicate and duplicate-invert operations instead
of duplicate-invert-delete operations may be desirable from a biological perspective be-
cause each duplicate and duplicate-invert requires only three breakpoints in the genome,
whereas a duplicate-invert-delete operation can be significantly more complicated, requir-
ing more breakpoints.
ANALYSIS OF HUMAN SEGMENTAL
                         DUPLICATIONS

The duplication distance model we presented in Chapter 2 is inspired by biological models
for the emergence of segmental duplications (or low-copy repeats) in mammalian genomes.
Approximately 5% of the human genome consists of segmental duplications > 1 kb in
length with ≥ 90% sequence identity between copies[6]. Segmental duplications account
for a significant fraction of the differences between humans and other primate genomes, and
are enriched for genes that are differentially expressed between the species [15]. Human
segmental duplications contain novel fusion genes[57], genes under strong positive selec-
tion [21], and new gene families[40]. Moreover, the presence of segmental duplications
appears to render regions of the genome more susceptible to recurrent and disease-causing
rearrangements[42] as well as additional copy-number variants [6] and inversions [61].

Reconstructing the evolutionary history of these genomic regions is a non-trivial, but im-
portant task as segmental duplications harbor recent primate-specific and human-specific
innovations [31]. Moreover, since segmental duplications arise as copy-number variants
that become fixed in a population, the evolutionary history of segmental duplications re-
veals information about the mechanisms and temporal dynamics of copy-number variants
in the human genome [38].

The availability of genome sequences from multiple mammalian genomes has led to pro-
posals to reconstruct the genome sequence of the mammalian ancestor [13]. Segmental
duplications remain an extreme challenge for evolutionary reconstruction, as they are the
“most structurally complex and dynamic regions of the human genome” [2].

Human segmental duplications are frequently found within complicated mosaics (duplication
blocks) of duplicated fragments (duplicons) that bear sequence similarity to non-homologous


                                            25
                                                                                         26

regions on multiple human chromosomes [6, 7, 6]. A two-step model of segmental dupli-
cation (reviewed in [6]) has been proposed to explain these mosaic patterns in pericen-
tromeric regions. In the two-step model, duplicons from disparate regions of the genome
(possibly different chromosomes) are first copied and aggregated in a seeding event. Then
in a second phase of pericentromeric transfer, contiguous sequences of duplicons are trans-
ferred en bloc by duplication to non-homologous pericentromeric regions. The result is
that “the pericentromeric region consists of many juxtaposed duplicons that originate from
diverse ancestral regions,” [6]. By contrast, duplication blocks in subtelomeric regions
are thought to arise from the process of double-stranded breakage and repair, resulting in
interchromosomal translocations of contiguous subtelomeric regions. Finally, interstitial
regions gain duplication blocks as a result of multiple rounds of serial duplication. These
three proposed mechanisms for the creation of duplication blocks in the human genome in-
dicates the complexity of these regions. As a result, the convoluted nature of overlapping,
interleaved duplicated material in the genome makes segmental duplications refractory to
traditional sequence analysis.

Jiang et al. [32] recently produced a comprehensive annotation of this mosaic organiza-
tion: they derived an “alphabet” of approximately 11,000 duplicons, and identified 437
duplication blocks, or “strings” containing at least 10 (and typically dozens) of different
duplicons. They also examined the ancestral relationships between human segmental du-
plications, and identified “clades” of segmental duplications that share an abundance of
repeated subsequences. However, their approach ignored the order and orientation of these
repeated subsequences within the segmental duplications, and thus did not explicitly ex-
plain the mosaic organization of segmental duplications. The relationships between these
annotated duplication blocks are complex (Fig. 3.1) and straightforward analysis does not
immediately reveal the ancestral relationships between blocks.

Numerous authors have considered the problem of analyzing relationships between genome
sequences that contain duplicated segments. This work falls into roughly two categories.
The first focus is the problem of computing genome rearrangement distances, like reversal
distance, in the presence of duplicated genes or synteny blocks (see [54, 43, 24], for exam-
ple). However, such rearrangement distances do not model the creation of new duplicates
and thus are not well-suited to describe the evolutionary history of segmental duplications
in the genome. The second focus is to analyze regions with duplications under “local” op-
erations like tandem duplications (see [18, 39], for example). While tandem duplication is
undoubtedly important in the generation of duplication blocks, there is strong evidence that
                                                                                         27


Figure 3.1: A graph of relationships between a subset of 357 duplication blocks in the human
genome. Each vertex is a duplication block, with edges joining blocks whose longest common
subsequence includes at least 3 duplicons.
                                                                                          28

an important characteristic of the history of segmental duplications is the frequent duplica-
tion and transposition of long segments over large physical distances; as many as 50%-60%
of segmental duplications were transposed interchromosomally [6]. Several general models
of rearrangement that allowed for both local operations and duplication-transposition-like
operations between different strings were studied by [27], but the generality of those mod-
els meant that the distances were NP-hard to compute and only approximation algorithms
were given.

In this chapter we consider the problem of constructing evolutionary histories that can ac-
count for the emergence of the duplication blocks we observe in the present-day human
genome. In Section 3.1, we formulate the problem as an integer linear program, inspired
by the two-step model of segmental duplication, with an objective function that minimizes
the total number of duplicate operations needed to construct all the present-day duplication
blocks in a two duplication phases. We represent the resulting evolutionary relationships
between duplication blocks as a tree with height two. We use the duplication distance al-
gorithm presented in Chapter 2 to find the optimal tree solution. Then in Section 3.2, we
generalize the optimization problem formulation to allow for the construction of an evo-
lutionary history directed acyclic graph (DAG). We define the optimization problem with
respect to two criteria: a parsimony criterion that again uses duplication distance as a mea-
sure of parsimony and a likelihood criterion that uses a probabilistic model of duplication
based on a partition function of the ensemble of all possible duplication scenarios. In both
sections, we apply our methods to the analysis of segmental duplications in the human
genome using the set of duplication blocks and constituent duplicons annotated in [32].

3.1   The Most-Parsimonious Two-Step Duplication Tree

As described above, duplication blocks, or segments of the present-day genome that con-
tain duplicated material, contain complex mosaic patterns of smaller segments, known as
duplicons, that appear in multiplicity across the genome. We model both the ancestral
and present-day genomes as signed strings on an alphabet of duplicons. We assume the
present-day genome, which has incurred segmental duplications, is a superstring of the an-
cestral genome, and the duplication blocks are substrings of the present-day genome. See
Figure 3.2.

Recall that according to the the two-step model of duplication, duplicons are copied from
their ancestral loci and aggregated into larger, contiguous segments or seed duplication
                                                                                             29

                       ancestral genome


                      present-day genome


                                                    duplication blocks


Figure 3.2: The present-day genome is a superstring of the ancestral genome. The duplicated
material comprises duplication blocks which are maximal contiguous substrings of the present-day
genome that were not part of the ancestral genome.


blocks during the first duplication phase. In the second phase, substrings of both the seed
blocks and the ancestral genome are copied and then reinserted into the genome at dis-
parate locations, creating secondary duplication blocks. We say that a seed duplication
block seeds a secondary duplication block if substrings of the seed block are used in the
construction of the secondary block in the second phase. See Figure 3.3(a). Note that a
seed block may seed multiple secondary duplication blocks but there may also exist some
seed blocks that do not seed any secondary blocks.

Here we build a duplication scenario that is consistent with a rather literal interpretation
of the two-step model of duplication and that minimizes the total number of duplication
operations needed to construct a given set of duplication blocks in two phases – first by
constructing a set of seed blocks by aggregating duplicons from the ancestral genome and
then by constructing a set of secondary blocks by copying substrings of the seed blocks as
well as singleton duplicons from their ancestral loci. We formulate this problem as an in-
teger linear program that is equivalent to the facility location problem, a classic problem in
operations research. The formulation requires a measure of the minimum number of dupli-
cate operations needed to build a target duplication block from a source duplication block;
we use the duplication distance measure described in Chaper 2. We apply our method to
duplication blocks derived in [32] and discover a two-step duplication scenario in which
64 seed duplication blocks are first constructed and then duplicated to create secondary
duplication blocks.

Note that we make four simplifying assumptions about the two-step model of duplication:

   1. The ancestral genome contains exactly one copy of every duplicon.

   2. No other type of rearrangement operations – such as inversions or deletions – occur.
                                                                                               30


                  ancestral genome

                       primary
                    duplications     {
              seed duplication blocks
                     secondary
                   duplications      {
            secondary duplication blocks

                                               (a)                          (b)


Figure 3.3: (a) The two-step model of duplication. Solid arrows indicate duplicons copied during
the first phase of duplication. Dashed arrows indicate duplicons copied during the second phase of
duplication. (b) The corresponding two-step duplication tree.


   3. The seed blocks are a subset of the duplication blocks observed in the present-day
      genome.

   4. Each secondary duplication block is seeded by exactly one seed duplication block.

Under these assumptions, we can describe a duplication scenario in which a set of duplica-
tion blocks are constructed in two phases by representing it as a two-step duplication tree
on the set of duplication blocks.
Definition 15. Given ancestral and present-day genomes, a two-step duplication tree is
a tree of height three where the root is the ancestral genome and the descendants are the
duplication blocks. Nodes at depth one (i.e. the children of the root) are the seed blocks
created in the first phase of duplication, while nodes at depth two (i.e. children of seed
blocks) are the secondary duplication blocks constructed from substrings of one seed block
and of the ancestral genome. (See Figure 3.3b.)

For a given pair of ancestral and present-day genomes, a most-parsimonious two-step dupli-
cation tree is that which defines a partition of the duplication blocks into seed duplication
blocks and secondary duplication blocks and defines the ancestral relationships between
seed and secondary blocks such that the total number of duplication events needed to con-
struct first the seed blocks and then the secondary blocks is minimum.

The total duplication distance for a two-step duplication tree is the sum of the number of
duplicate operations needed to build all the duplication blocks. We express the number
of duplicate operations needed to build a seed block Bi from the ancestral genome G as
d(G, Bi ). Secondary duplication blocks are built from substrings of both its parent seed
block and the ancestral genome. Thus, we express the number of duplicate operations
needed to build a secondary block Bj from its parent seed block Bi and G as d(Bi ◦ G, Bj ),
                                                                                                      31

where Bi ◦ G denotes the concatenation1 of the strings Bi and G.

We now have the following definition.
Definition 16. Given the ancestral genome G and a set duplication blocks B1 , . . . , BN
from the present-day genome, a most-parsimonious two-step duplication tree is a two-step
duplication tree (Def. 15) with minimum total duplication distance on its edges.

We note that the definition of a most-parsimonious two-step duplication tree can be ex-
tended to more general distance measures. For example, a suitable measure of parsimony
could be duplication-inversion distance or any of the other extensions of duplication dis-
tance presented in Chapter 2.

Now we show how to formulate the problem of constructing a most-parsimonious two-step
duplication tree as an integer linear program (ILP).

A two-step duplication tree for a given ancestral genome and a set of duplication blocks
is defined by a labeling of each of the N duplication blocks as either seed blocks or as
secondary blocks. In addition to this labeling, we must also define for each secondary
block which seed duplication block seeded it, i.e. which seed block is its parent in the tree.

A most-parsimonious two-step duplication tree is a solution of the following integer linear
program.
                        "   N                          N X
                                                         N
                                                                                      #
                            X                          X
                  min             (ui × d(G, Bi )) +             (vij × d(Bj ◦ G, Bi ))            (3.1)
                  U,V
                            i=1                        i=1 j=1


such that
                                           X
                                                 vij = 1 for all i                                 (3.2)
                                             j

                                          vij − uj ≤ 0 for all i, j                                (3.3)
                                       ui ∈ {0, 1} and vij ∈ {0, 1}.                               (3.4)

The binary variables U = [u1 , . . . , uN ] and binary matrix V = [vij ]N
                                                                        i,j=1 describe the topol-
ogy of the duplication tree. The binary variable ui indicates whether a duplication block
Bi is labeled as a seed block and thus defines an edge in the tree from the root G to Bi . The
binary variable vij indicates that secondary duplication block Bi is seeded by seed block
   1
     We insert a “dummy character” between Bi and G in the concatenate to avoid copying substrings across
the boundary.
                                                                                                           32

Bj and thus block Bi is a child of Bj in the tree. Again, we note that the duplication dis-
tance function, d, in the above program could be substituted by any other suitable distance
function between strings with duplications, such as duplication-inversion distance.

We note that this program is equivalent to a special case of the facility location problem,
a classic NP-hard combinatorial optimization problem. The input to the facility location
problem is a set of customers and a set of potential facility sites. For each site, there is a cost
associated with opening a facility, and for each site-customer pair, there is a cost associated
with supplying that customer from a facility at that site. The objective is to minimize the
total cost of opening facilities and supplying customers such that every customer is supplied
by exactly one open facility. In the context of the two-step duplication tree, each duplication
block is both a customer to be supplied and the site of a potential facility. Opening a facility
at site Bi corresponds to classifying Bi as a seed duplication block and the cost of opening
a facility corresponds to the cost of constructing that seed block by aggregating singleton
duplicons from their ancestral loci. Supplying customer Bj from facility Bi corresponds to
classifying Bj as a secondary block that is constructed from substrings of seed block Bi and
the ancestral genome G, and the cost of supplying Bj from Bi is equal to the duplication
distance from Bj to Bi .

We implemented our two-step duplication tree method to analyze the ancestry of segmental
duplications in the human genome using duplication-inversion distance as the measure of
parsimony. We used data from [32] who identified 417 contiguous duplication blocks in
the human genome (hg17, May 2004). The duplication blocks were comprised of mosaic
patterns of a total of 4,692 distinct duplicon sequences. [32] delimited regions of homology
for each duplicon, respectively, within the set of duplication blocks with some duplication
blocks containing tens of thousands of duplicons. Then the authors partitioned the dupli-
cation blocks into 24 “clades” or groups that they believed to have been derived from a
common seed block ancestor. The clade analysis done by [32] was based on a hierarchi-
cal clustering of the duplication blocks by comparing their respective duplicon contents
without regard to the order or orientation of subsequences of duplicons within blocks.

To begin our analysis, we represented each of the 417 duplication block as a signed string
on the alphabet of integers between −4692 and +4692.2 We represented the ancestral
genome G, containing a unique copy of each of the non-homologous ancestral duplicons,
   2
     A total of 437 duplication blocks were identified in the study by [32] but 20 of these blocks were missing
their duplicon annotations.
                                                                                                          33

as the non-ambiguous string of all duplicons (with positive orientations) with dummy char-
acters inserted in between every pair of characters, i.e. G = +1  +2  · · ·  +4692, where
 denotes a dummy character.

For each ordered pair of duplication blocks Bi , Bj , we computed the duplication-inversion
         ¯ i ◦ G, Bj ) using the algorithm presented in Chapter 2 (Eq. 2.6). Note that,
distance d(B
                                              ¯ Bi ) from G to that block, is equal to the
for every duplication block Bi , the distance d(G,
length of the target block | Bi |. With these distances, we solved the ILP in Eq. 3.1 using
the optimization package CPLEX. Given a solution to the ILP, for every index i such that
the binary variable ui = 1 (i.e. such that facility i is opened), we labeled the block Bi as a
seed block. We labeled the remaining blocks as secondary blocks. For any pair of blocks
Bi , Bj , if the binary variable vij = 1 in the solution, we designated secondary block Bi
to be a child of seed block Bj (i.e. customer i is supplied from facility j). The resulting
two-step duplication tree is shown in Figure 3.4.3

The two-step duplication tree for the 417 human duplication blocks exhibits 64 seed blocks
with varying numbers of secondary blocks, respectively, ranging from 1 to 28. We com-
pared our analysis to the clade analysis of [32]. Note that the clade analysis represents
a partition of the duplication blocks into groups that are believed to have evolved from a
common seeding “ancestor” block. Similarly, our two-step duplication tree defines groups
of blocks that might have evolved from a common ancestor block, namely the groups of
blocks defined by a subtree rooted at a given seed block. Furthermore, our two-step dupli-
cation tree defines putative ancestral relationships between duplication blocks, indicating
which duplication blocks may have seeded others.

After computing our tree, we colored the nodes of the tree according to the clade parti-
tion computed by [32] in a post-process. A visual inspection of the two-step duplication
tree reveals an interesting relationship between our analysis and that of [32]: many of the
subtrees rooted at a seed block, called seed block groups, are monochromatic or nearly
so. (The largest seed block groups can be viewed in Fig. 3.5.) This concordance between
our analysis and that of [32] indicates that we discovered many of the same relationships
between groups of duplication blocks. However, many of our seed block groups were not
monochromatic. To quantify the agreeance of the two analyses, we computed a χ2 test of
independence for the set of seed block groups we derived and the set of clades derived by
   3
     A previous version of this result appeared in [37]. Here we present a more recent, previously unpublished
result. The difference between the previously published tree and that shown here owes to a revised annotation
of the set of duplication blocks identified by [32].
                                                                                            34


                                                                           Clade


Figure 3.4: Most-parsimonious two-step duplication tree. The large, red node in the center is
the root and represents the ancestral genome. The 64 children of the root are seed duplication
blocks that were constructed during the first “seeding” phase of duplication. Children of seed
blocks were constructed during the second phase of duplication. For comparison, the duplication
blocks are colored according to the clade annotations computed in [32]. (Blocks labeled with
clade ‘s’ are members of small clades with five or fewer members.) The seed blocks are:
chr9:94148625-94280597, chr2:86972987-87534573, chr2:95487038-95584611, chr13:51949404-
52115934, chr2:131224724-131311234, chr17:21390703-21507201, chr6:58245619-58614492,
chrY:7321346-7583561, chr20:23590986-23799725, chr19:142690-301857, chr9:41693541-
41917754, chr13:22383730-22442581, chr18:14925636-15381131, chr2:111008489-111108421,
chr9:67819771-68015022, chr7:22302593-22353864, chr15:21883445-22374842, chr7:64400020-
64811464, chr21:32722334-32748407, chr22:21973667-22071991, chr17:16500312-16755448,
chr2:106442349-106590011,          chr22:23318989-23413657,        chr2:131763427-131992854,
chr6:5001-107014, chr19:59919900-60071479, chr7:45538339-45670718, chr15:30232700-
30686935, chr15:28156284-28697532, chr1:13071685-13302468, chr15:28722450-28924396,
chr15:75938308-76080230, chr18:14115138-14897871, chrY:2951337-3780270, chr16:21261363-
21477613,      chr13:92057461-92135789,        chr12:36142962-36276062,       chr5:174269510-
174290127,       chr2:97093305-97342494,        chr7:63160249-63206018,       chr17:23079959-
23123088, chr3:126882689-127197788, chr17:4963474-5027382, chr2:94748046-95035488,
chr5:49692886-49892733, chr15:26189051-26378746, chr1:561232-873944, chr2:97484061-
97718125,      chr16:18074902-18712195,        chr15:76779895-76886440,       chr21:13291342-
14363850,       chr7:56385628-56445155,        chr1:145513740-145734052,       chr7:56461815-
56554511, chr21:28138407-28357448, chrY:12885909-13066979, chr7:43774170-43854733,
chr22:15694297-15767503,          chr1:142607398-142875550,         chr20:23905387-23924629,
chr7:55505324-55618047, chr7:63313785-63361829, chr7:57180803-57309010, chr1:142299774-
142408068.
                                                                                               35


                                                                            Clade


Figure 3.5: Most-parsimonious two-step duplication tree showing only subtrees of size at least 12.
The large, red node in the center is the root and represents the ancestral genome. The 6 children
of the root are seed duplication blocks that were constructed during the first “seeding” phase of
duplication. Children of seed blocks were constructed during the second phase of duplication.
For comparison, the duplication blocks are colored according to the clade annotations computed in
[32]. The seed blocks are: chr2:106442349-106590011, chr9:41693541-41917754, chr12:8206636-
8492606, chr7:22302593-22353864, chr15:26189051-26378746, chr18:14925636-15381131.


[32]. For groups containing at least six members (of which 34 seed block groups qualified
and all 24 clades qualified), the probability that the correlation between the two categories
was due to chance was P < 0.35.

Without strong evidence by which to conclude that the analysis of [32] corroborates ours,
we sought to refine our analysis. Admittedly, our two-step duplication tree formulation
interprets the two-step model of segmental duplication rather literally. It is unlikely that
all the duplication blocks that exist in the human genome today were constructed in ex-
actly two phases of duplication; it seems more plausible that perhaps several rounds of
duplication took place with secondary blocks seeding tertiary blocks and tertiary blocks
seeding quaternary blocks, etc. Moreover, if the model of duplication block construction
via multiple rounds of duplication is plausible, then it seems reasonable to assume that any
particular duplication block might have been constructed from multiple duplication events
in which the duplicated material originated from more than one external seed block. That
                                                                                          36

is, the tree structure of our two-step duplication tree solution may be overly restrictive as
some duplication blocks might have been seeded from multiple “parent” blocks.

In the next section, we refine our analysis of human duplication blocks by reformulating the
problem of constructing a duplication history as the problem of computing an optimal di-
rected acyclic graph (DAG) on the set of duplication blocks according to either a parsimony
or a likelihood criterion.

3.2    The Max Parsimony and Max Likelihood Duplication History DAGs

Here we present a novel formulation of the problem of computing an evolutionary history
for a set of segmental duplications that are organized in duplication blocks. We represent
evolutionary relationships between a set of duplication blocks as a directed acyclic graph
(DAG), relaxing some of the constraints of the problem formulation given in the previous
section. We formalize the evolutionary reconstruction problem as an optimization over the
space of DAGs.

We present two different methods for scoring a DAG: one based on parsimony and one
based on likelihood. The parsimony score for a DAG is a straightforward extension of du-
plication distance that describes the most-parsimonious sequence of duplicate operations
needed to construct a given target string. Because we have presented it in Chapter 2, here
we forgo a description of the duplication distance measure or algorithm for computing it.
The likelihood score for a DAG is the product of the likelihood scores for each of the dupli-
cation blocks, where a duplication block’s likelihood is derived by computing the weighted
ensemble of all possible duplication scenarios that could have generated it. We describe
how to compute the partition function of the ensemble efficiently using a dynamic pro-
gram that generalizes the duplication distance (i.e. parsimony score) recurrence. Deriving
a probabilistic model from a dynamic program this way is analogous to the approach of
[44] who applied dynamic programming to RNA folding to compute the partition function
of all secondary structures and to assign probabilities to certain substructures..

Finally, we solve these evolutionary reconstruction problems on the set of duplication
blocks identified by [32] using a local search technique based on simulated annealing.
We compare these reconstructions to the analysis of [32]. Our evolutionary reconstruction
recapitulates some of the properties of earlier analysis but also reveals additional and more
subtle relationships between segmental duplications.
                                                                                                     37

                             X=                                      abcde
                            Y0 =                                            ∅
                            Y 1 = Y 0 ◦ δX (1, 3, 1) =                   abc
                            Y 2 = Y 1 ◦ δX (4, 5, 1) =               deabc
                             Y = Y 2 ◦ δX (4, 5, 5) = deabdec

Figure 3.6: An example of a sequence of duplicate operations that constructs Y = deabdec
from X = abcde. The corresponding feasible generator is: ΨX = (X4,5 , X1,3 , X4,5 ) =
((de), (abc), (de)).


3.2.1    The Partition Function

We begin our discussion of the likelihood-based optimization problem with some prelimi-
naries. Recall from Chapter 2 the following. Given a source/target pair X, Y , any sequence
of duplicate operations of the form δX (s1 , t1 , p1 ), . . . , δX (sd , td , pd ) that generates Y from
X uniquely partitions the characters of Y into non-overlapping subsequences correspond-
ing to characters that were copied conjointly from X.
Definition 17. Given a source string X, a generator ΨX = (Xi1 ,j1 , . . . , Xik ,jk ) is a se-
quence of substrings of X.
Definition 18. A generator ΨX = (Xi1 ,j1 , . . . , Xik ,jk ) is feasible for a target string Y , that
we denote as ΨX a Y , if:

   1. The elements of ΨX partition the characters of Y into mutually non-overlapping
        subsequences {S1 , . . . , Sk }.
   2. There exists a bijective mapping f : {Xi,j ∈ ΨX } → {S1 , . . . , Sk } from substrings
        of X to subsequences in Y corresponding to how the elements of ΨX partition Y .
   3. The order of elements in ΨX corresponds to the order of the leftmost characters of
        the subsequences f (Xi1 ,j1 ), . . . , f (Xik ,jk ) in Y .

See Fig. 3.6.

A sequence of k duplicate operations that constructs Y from X uniquely defines a feasible
generator ΨX with length k whose elements correspond, respectively, to substrings of X
that are duplicated conjointly in a single operation.

While a parsimony assumption is attractive from a theoretical perspective and can produce
useful biological insight, it might be overly restrictive, particularly when there are many
                                                                                          38

different optimal or nearly optimal solutions. Consider, for example, the strings X =
abcdef ghijkl and Y = agdbhecif dajebkfclg. The duplication distance, d(X, Y ), is 13
and there is a single feasible generator with this optimum length. However, there are 989
possible feasible generators for Y , 119 of which have length 14, just slightly suboptimal.
Because the space of all possible feasible generators is very large, a probabilistic model
might give very low probability to an optimal parsimony solution. Thus, here we present
a probabilistic model of segmental duplication that considers the weighted ensemble of all
feasible generators for a source/target string pair.

For a given source string X and positive integer k we consider the space of all length-
k generators ΨX . We define a probability distribution on the collection of generators by
defining P r[ΨX ] ∝ ω(ΨX ) where ω(ΨX ) is the “score,” or weight, assigned to a generator,
                                          (k)
and we compute the partition function ZX of the weighted ensemble of all possible length-
k generators ΨX . Given a source string X and a target string Y , we define the event F to be
the event of choosing a length-k generator that is feasible for Y from the space of length-k
generators. We define a probabilistic model for segmental duplications that, given a target
string Y , assigns a probability to F : P r[F | Y, X, k]. For a fixed target string Y , the
probability, P r[F | Y, X, k], is the weighted ensemble of all possible length-k generators
                                                                            (k)
that are feasible for Y , normalized by the partition function ZX . In particular, we can
express the probability as:


                                                1          X
                         P r[F | Y, X, k] =     (k)
                                                                        ω(ΨX ),         (3.5)
                                              ZX      ΨX a Y :|ΨX |=k


where | ΨX | denotes the length of the generator. The likelihood of a target string Y , then
can be expressed as L(Y | F, X, k) = P r(F | Y, X, k).

The score of a generator, ω(ΨX ), can be defined according to various biological models.
Although different functions ω may require different algorithms for computing the value
P r[F | Y, X, k], we found that functions of the form ω(ΨX ) = σ(| ΨX |, l(ΨX )) where
l(ΨX ) = Xi,j ∈ΨX | Xi,j | denotes the sum of the lengths of the elements of ΨX , admit
         P

particularly efficient algorithms for computing Eq. (3.5). We discuss the score function
further in Sec. 3.2.2.
                                                                           (k)
Now we give an algorithm to compute the partition function, ZX . Given a score function
of the form σ(| ΨX |, l(ΨX )), each length-k generator whose elements have lengths that
                                                                                  (k)
sum to l are scored the same, namely σ(k, l). Therefore, in order to compute ZX , we must
                                                                                                        39

calculate the total number of length-k generators whose lengths sum to l for all relevant
                        (k)
values of l. Let CX (l) equal the number of distinct length-k generators for which the sum
of the lengths of the elements equals l.4
Lemma 5. Let X = x1 . . . x|X| be a source string and let k and l be positive integers. The
            (k)
function CX (l) satisfies the following recurrence.

                                (1)
                              CX (l) = | X | −l + 1,
                                         l−1
                                         X
                               (k)              (k−1)
                              CX (l) =        CX (l0 ) · (| X | −(l − l0 ) + 1).
                                          l0 =l−|X|

                                           (1)
Proof: (Sketch) The base case, CX (l), counts the number of distinct substrings of X with
length l. In the case that k = 1, the number of substrings of X that have length l is equal
                                            (k)
to X − l + 1 . Then, the value CX (l) can be computed recursively by summing over all
the ways of adding a k th element to a set of k − 1 substrings such that the resulting set of
k elements has total length l. For a set of k − 1 substrings of X whose lengths sum to l0 ,
there are X − (l − l0 ) + 1 substrings of X that could be added to the set to yield a set of k
substrings whose lengths sum to l.
                                                                              (k)                 (k)
For a source string X and integers k, l, if we are given CX (l), we can compute ZX ef-
                                    (k)
ficiently by summing CX (l) over all relevant lengths l, weighting each feasible generator
appropriately according to the function σ(k, l). This gives the following theorem.
Theorem 5. Let X = x1 . . . x|X| be a source string and k be a positive integer. The
                              (k)
partition function ZX satisfies the following.
                                                      |X|·k
                                           (k)
                                                      X        (k)
                                          ZX      =           CX (l) · σ(k, l).
                                                      l=k

Note that the elements of a length-k list of substrings of X can have lengths that sum to at
least k and at most | X | ·k.
                                                                                    (k)
The recurrence in Lemma 5 can be computed in O(| X | k) time, so ZX can be computed
in O(| X |2 k 2 ) time according to Theorem 5.
                  (k)
   4
     The value CX (l) is related to the well-known integer partition function p(n) and corresponding Young
                                                                                                  (k)
tableaux.    If P(l, k) is the set of partitions of the integer l into k parts, we can express CX (l) =
                p∈P (| X | −p + 1) · k!.
P            P
   P ∈P(l,k)
                                                                                             40


3.2.2    The Score Function

We define the score of a generator ω(ΨX ) to be some function that reflects the biological
plausibility of the event of choosing a particular generator ΨX from the space of all gener-
ators and then duplicating the substrings of ΨX in some duplication scenario. When infer-
ring a sequence of duplicate operations that can account for the construction of a particular
target string Y by copying substrings of a particular source string X, a reasonable assump-
tion is that the “simplest” explanation is the best. We consider the most-parsimonious
duplication scenario–that is, the one requiring the fewest number of duplicate operations–
to be the simplest. As noted above, the most-parsimonious solution can be computed using
the duplication distance algorithm presented in [36, 37]. Therefore, our first considera-
tion for scoring a generator is that one with small length, e.g. with length equal to the
duplication distance, ought to have a good score.

Given a source string X, and two different target strings Y1 and Y2 , where | Y1 |<| Y2 |,
we assume that if the character contents of Y1 and Y2 are similar, then the construction
of Y1 from X is more likely than the construction of Y2 from X. Again, this assumption
favors simplicity. Therefore, two generators with length k that are feasible for Y1 and Y2 ,
respectively, should be scored in a way that the generator for Y1 is preferable to that for Y2 .

Theorems 2.7 and 2.9 allow for a score function of the form σ(| ΨX |, l(ΨX )). However,
we impose two additional conditions that are biologically plausible.

   a. For integers k1 < k2 , σ(k1 , l(ΨX )) > σ(k2 , l(ΨX )). This property matches our
        intuition that a feasible generator with lesser cardinality (corresponding to a shorter
        sequence of duplicate operations needed to construct the target string) be more likely
        than a feasible generator with higher cardinality.

   b. For identical source and target strings, X = Y , of length | X |=| Y |> 1, σ(1, |
      Y |) > ΨX :|ΨX |=k σ(k, | Y |) for any k > 1. This will ensure that the event F of
              P

        choosing a length-k generator for any k > 1 is less probable than choosing a length-1
        generator; i.e. P r[F | Y, X, k] < P r[F | Y, X, 1]. This matches our intuition that
        the unique feasible generator of length 1 corresponding to the construction of Y by
        simply duplicating all of X in a single operation, will have higher probability than
        the combination of all feasible generators of length k > 1. Note that when X 6= Y ,
        this property also ensures an analogous preference of feasible generators containing
        any long, contiguous substring Xs,t that appears as a substring of Y over feasible
                                                                                                         41

        generators that contain fragmented portions of Xs,t to generate the same substring in
        Y , all other elements being equal.

A suitable score function that meets these criteria is:


                                                                  1
                                              σ(k, | Y |) =           .                               (3.6)
                                                               | Y |k

Undoubtedly, there are other biologically motivated score functions that may produce mean-
ingful results.

3.2.3     Restricted Partition Function

In this section, we present the final ingredient necessary to compute the probability P r[F |
                                                                          (k)
Y, X, k], namely the sum in Eq. (3.5) that we define as QX (Y ). We refer to the value
  (k)
QX (Y ) as the restricted partition function of feasible generators, and it is equal to the
                                                                                                (k)
weighted ensemble of all length-k generators ΨX that are feasible for Y . Hence QX (Y ) =
                            ΨX aY :|ΨX |=k σ(k, | Y |).
P                         P
  ΨX aY :|ΨX |=k ω(ΨX ) =

In order to compute this value, we generalize the recurrence presented in Chapter 2 for
computing duplication distance from source string X to target string Y to count the number
of length-k generators that are feasible for Y .
Lemma 6. Given a source string X = x1 . . . x|X| and a target string Y = y1 , . . . , y|Y | ,
                  (k)
the number NX (Y ) of distinct length-k generators ΨX that are feasible for Y satisfies the
following recurrence.

         (k)
                              X        (k)
        NX (Y ) =                    NX (Y, i),
                          i:xi =y1
                          
                          1         if Y = Xi,i+|Y |−1 ,
        (1)
    NX (Y, i) =
                          0         otherwise,
                                                     X        k
                                                              X
     (k)                   (k−1)                                  (l)            (k−l)
    NX (Y, i)       =     NX (Y2,|Y | )       +                 [NX (Y2,j−1 ) · NX (Yj,|Y | , i + 1)].
                                                  j>1:yj =xi+1 l=1


                        (k)
Here, the term NX (Y, i) represents the number of feasible generators ΨX with length k
given that the character y1 is generated by a substring of X starting at xi .

First, we give intuition for the recurrence in Lemma 6, and then we sketch a proof of its
correctness. We note that the proof of correctness for Lemma 6 mirrors, in many ways,
                                                                                           42

the proof of correctness of Theorem 1 that gives a recurrence for computing a minimum-
length feasible generator for a source string X and a target string Y ; here, instead, we want
to count the total number of feasible generators ΨX that have a fixed length k.

Recall the non-overlapping property stated as Lemma 1. The recurrence for computing
  (k)
NX (Y ) is efficient because the non-overlapping property allows us to subdivide the char-
acters of the target string Y into independent subproblems. For example, if we are consid-
ering the set of feasible generators that contain some subsequence S of Y , SS = {ΨX : S ∈
ΨX }, then for every ΨX ∈ SS , all other elements of ΨX cannot overlap the characters in S.
Therefore, the substrings of Y in between successive characters of S define subproblems
that can be computed independently.

Note that every character of Y must appear at least once in X. In order to count the number
of feasible generators ΨX with length k > 1, we must consider all subsequences of Y that
could have been generated by a single duplicate operation and the number of ways we could
combine exactly k of those subsequences to form a feasible generator ΨX . The recurrence
is based on the observation that in any feasible generator, ΨX , y1 must be the first (i.e.
leftmost) character in some element of ΨX . There are then two cases to consider: either
(1) y1 was the last (or rightmost) character in the substring that was duplicated from X to
generate y1 , or (2) y1 was not the last character in the substring that was duplicated from X
to generate y1 .

Proof: (sketch)
                                              (k)        (k)
The recurrence defines two quantities: NX (Y ) and NX (Y, i). We shall show, by induc-
                                                                       (k)
tion, on | Y | and k that for a pair of strings, X and Y , the value NX (Y ) is equal to the
                                                          (k)
number of length-k feasible generators ΨX , and that NX (Y, i) is equal to the number of
length-k feasible generators ΨX under the restriction that the character y1 is copied from
                                        (k)
index i in X, i.e. xi generates y1 . NX (Y ) is computed by summing over all characters xi
of X that can generate y1 ..
                                                                                   (k)
As described above, we must consider two possibilities in order to compute NX (Y ). In
every feasible generator ΨX , the character y1 must appear in some subsequence Sy1 ∈ ΨX
of Y that contains y1 as a leftmost character and that corresponds to a substring of X that
was copied conjointly to produce the subsequence Sy1 . Either:

    • Case 1: y1 was the last (or rightmost) character in the substring of X that was copied
        to produce y1 , i.e. Sy1 has length 1, or
                                                                                               43

   • Case 2: xi+1 is also copied in the same duplicate operation as xi , possibly along with
      other characters as well, i.e. Sy1 has length greater than 1.

For case one, number of length-k feasible generators ΨX is equal to the number of length-
(k − 1) feasible generators ΨX (Y2,|Y | ) for source string X and target string Y2,|Y | (the
suffix of Y ); the union of the subsequence corresponding to the single character y1 and
any length-(k − 1) feasible generator ΨX (Y2,|Y | ) results in a length-k feasible generator
ΨX . For case two, Lemma 1 implies that the total number of length-k feasible generators
ΨX is the product of two independent subproblems. Specifically, for each j > 1 such that
xi+1 = yj and for each l ∈ {1, 2, . . . , k}, we compute: (i) number of length-l feasible
                                                                                 (l)
generators for source string X and target string Y2,j−1 , namely NX (Y2,j−1 ), and (ii) the
number of length-(k − l) feasible generators for source string X and target string y1 Yj,|Y |
that include an element Sy1 in which y1 is generated by xi . To compute the latter, recall that
all relevant feasible generators (corresponding to case 2 above) ΨX must contain an element
that corresponds to a duplicate operation in which xi and xi+1 are copied conjointly. The
number of relevant length-(k − l) feasible generators for source string X and target string
y1 Yj,|Y | that contain an element Sy1 that corresponds to a substring of X starting at xi and
also containing xi+1 is equal to the number of relevant length-(k − l) feasible generators
for source string X and target string Yj,|Y | that contain some element Syj that corresponds
                                                    (k−1)
to a substring of X starting at xi+1 , namely NX            (Yj,|Y | , i + 1).                 
                                                     (k)
We compute the restricted partition function QX (Y ) efficiently by first counting the num-
                                                    (k)
ber of relevant feasible generators, namely NX (Y ), and scoring each generator appropri-
ately by σ(k, | Y |). This gives us the following theorem.
Theorem 6. Let X = x1 . . . x|X| , Y = y1 , . . . , y|Y | be a source/target string pair and let k
                                                                  (k)
be a positive integer. The restricted partition function QX (Y ) satisfies the following.

                                 (k)          (k)
                               QX (Y ) = NX (Y ) · σ(k, | Y |).


The recurrence given in Lemma 6 can be computed in time O(| Y |2 k 2 µ(Y )µ(X)) where
µ(Y ) (resp. µ(X)) is the maximum multiplicity of any character that appears in Y (resp.
                      (k)
X), so computing QX (Y ) takes the same time.

The probabilistic model of duplication based on the ensemble of feasible generators pre-
sented in this chapter is just one of many possible models one could imagine. Ultimately,
a model ought to reflect a reasonable approximation of a biologically plausible events, but
                                                                                            44

should also admit an efficiently computable algorithm. We give another generative proba-
bilistic model of duplication in Appendix B.

3.2.4   Problem Formulation

Here we formalize the problem of computing a segmental duplication evolutionary history
for a set of duplication blocks in the human genome with respect to either a parsimony or
likelihood criterion.

The input to our problem is the set of duplication blocks found in the human genome, each
represented as a signed string on the alphabet of duplicons. Our goal is to compute a puta-
tive duplication history that accounts for the construction of all of the duplication blocks.
We assume that the ancestral genome is devoid of segmental duplications. A duplication
history is a sequence of duplicate events that first builds up a set of seed duplication blocks
by duplicating and aggregating duplicons from their ancestral loci and then successively
constructs the remaining duplication blocks by duplicating substrings of previously con-
structed blocks.

We observed in [36] strong evidence that many of the duplication blocks identified by [32]
had been constructed through the duplication and aggregation of substrings of duplicons
from several other blocks. Therefore, a tree cannot aptly represent an evolutionary history;
a more appropriate representation of the evolutionary relationships between duplication
blocks is a DAG in which the vertices represent duplication blocks and an edge directed
from a vertex X to a vertex Y indicates that substrings of X were duplicated in the construc-
tion of Y . A vertex with multiple incoming edges and, therefore, multiple parents, is con-
structed using substrings of all of the parent blocks. Specifically, given a DAG G = (D, E),
for Y ∈ D, we define PG (Y ), the parent string of Y , by PG (Y ) = X1  X2  · · ·  Xp
where Xi ∈ {D | (Xi , Y ) ∈ E} and  indicates the concatenation of two strings with a
dummy character inserted in between.

We make two simplifying assumptions. First, we assume that only duplicate events occur
and that there are no deletions, inversions, or other types of rearrangements within a dupli-
cation block. Second, we assume that a duplication block is not copied and used to make
another duplication block until after it has been fully constructed, ensuring the evolutionary
relationships cannot contain cycles. We acknowledge that our two simplifying assumptions
restrict the evolutionary history reconstruction problem significantly, but admit an efficient
and consistent method of scoring a solution. Similar assumptions were made, for example,
                                                                                           45

by [52] to derive the evolutionary tree for Alu repeat elements.

We can define the optimal DAG with respect to a parsimony criterion using duplication
distance (see Ch. 2).
Definition 19. Given a set of duplication blocks D, the maximum parsimony evolutionary
                                                        P
history is the DAG G = (D, E) that minimizes f (G) = Y ∈D d(PG (Y ), Y ).

We can also define the optimal DAG with respect to a likelihood criterion. In phylogenetic
tree reconstruction, a max likelihood solution is a tree that maximizes the probability of
generating the characters at the leaf nodes over all possible tree topologies, branch lengths,
and assignments of ancestral states to the internal nodes. Typically, the evolutionary pro-
cess is assumed to be a Markov process so that the probabilities along different branches
are independent. We similarly define the maximum likelihood DAG using the probabilistic
model derived in Section ??. We maximize the likelihood of the solution over all DAG
topologies and–instead of branch lengths–the numbers of operations permitted to construct
each node.
Definition 20. Given a set of duplication blocks D, the maximum likelihood evolutionary
history is the DAG G = (D, E) that maximizes the likelihood:
                                                    Q
                          L(G)                  =       Y ∈D   L(Y ),
                                              (maxk P r[F | Y, PG (Y ), k]) ,
                                     Q
                                 =       Y ∈D
                                                                           
                                     Q                 (k)            (k)
                                 =       Y ∈D   maxk Q PG (Y ) (Y )/ZPG (Y ) ,

         (k)              (k)
where ZPG (X) and QPG (Y ) are the partition function and restricted partition functions, re-
spectively.

3.2.5   Results

We analyzed a set of 391 duplication blocks identified by [32] that were represented as
signed strings on an alphabet of ≈ 5, 000 duplicons. We computed the maximum par-
simony evolutionary history (Def. 19) for the entire set of blocks (see Fig. 3.7). The
DAG exhibited multiple connected components. For comparison, we then computed the
maximum likelihood evolutionary histories (Def. 20) for several of the subgraphs induced
by connected components of the parsimony solution. We scored generators according to
σ(k, | Y |) =     1
                |Y |k
                      .

We used a simulated annealing strategy to find a maximum parsimony DAG for the entire
                                                                                                     Clade


Figure 3.7: The maximum parsimony DAG for a set of 391 duplication blocks in the human genome. Edges indicate evolutionary relations; an
edge is directed from a node u to a node v if the most-parsimonious duplication scenario includes duplication events that copy substrings of u in
the construction of v. [32] partitioned the duplication blocks into a set of 24 clades (plus one ‘s’ group of duplication blocks found in subtelomeric
regions) that we indicate here with 25 colors on nodes. The 3 sets of colored edges represent inheritance networks for 3 conserved subsequences of
duplicons. These inheritance networks are almost entirely confined to a single clade each. The green edges represent the inheritance of the duplicon
sequence [6968, 6967, 6965, 6963, 6962, 6960] in clade ‘M1’, the red edges represent the inheritance of [7039, 7036, 7037] in clade ‘M2’, and the
                                                                                                                                                         46


blue edges represent the inheritance of [9448, 9449] in clade ‘chr16.’
                                                                                                  47


Figure 3.8: A connected component of the maximum parsimony DAG. Nodes from clade ‘M1’ are
red and nodes from clade ‘chr7 2’ are green. Node labels correspond to duplication block IDs. The
blue edges represents the inheritance network for non-core duplicon 6970.


                                                         DAGs for several subgraphs.5
set of duplication blocks and to find maximum likelihood Clade
For each input, we ran our local search 300 times. We started the search an equal number
of times at each of three different types of initial graphs: (a) the empty graph with no edges;
(b) the directed minimum spanning tree (MST); and (c) a randomly chosen DAG (chosen
independently for each trial). Finally, to focus the search on the most important block rela-
tionships, we considered only edges between blocks whose longest common subsequence
(LCS) contained at least 20 duplicons. We describe the simulated annealing heuristic in
more detail in Section 3.2.8.

3.2.6   Maximum Parsimony Reconstruction

The maximum parsimony DAG contains 391 nodes and 479 edges. There are 9 connected
components with at least 4 duplication blocks, and nearly 40% of the blocks appear in the
largest connected component. Figure 3.8 shows a moderately-sized connected component.
The graph also contains a total of 105 singleton nodes for which we did not infer any
evolutionary relations with other duplication blocks, 97 of which did not exhibit an LCS of
length 20 with any other block.

The maximum parsimony DAG represents a scenario in which all 391 duplication blocks
   5
    Both the max parsimony and max likelihood versions of the problem can be shown to be NP-hard by a
reduction from the problem of Learning Bayesian Networks.
                                                                                                 48


Figure 3.9: An example of duplication block recombination. The target duplication block in the
middle (chr9:65.9-66.5) exhibits subsequences that appear as contiguous substrings in four other
seeding duplication blocks. The nested relationships between subsequences in the target block (e.g.
the green subsequence is nested inside the purple one and the red subsequence is nested inside
the blue one) allow us to conclude the target block was composed of duplicated substrings from the
other four blocks and not vice versa. Moreover, the nested relationships between these subsequences
imply an order of duplication events (i.e. the green subsequence was duplicated after the purple one
and the red subsequence was duplicated after the blue one).


could have been constructed in a sequence of 17,431 total duplicate operations. As a base-
line comparison, a minimum spanning tree, with respect to duplication distance, on the set
of duplication blocks has a total parsimony score of 28,852 and, by definition, contains 390
edges.

A striking feature of the max parsimony DAG was the occurrence of duplication block
recombination, or the creation of a single target block by duplicating and aggregating sub-
strings from multiple parent blocks. See Fig. 3.9 for an example of a duplication block
that exhibits a subsequence composition that can best be described as the result of seed-
ing events involving multiple parents. The existence of a duplication block that contains
subsequences contributed by multiple parent blocks was not possible in the two-step du-
plication tree formulation presented in Section 3.1, underscoring the differences between
the two approaches. In total, 128 duplication blocks exhibited multiple parents. Of those,
105 exhibited at least two parents that contributed, respectively, at least 10% of the dupli-
con content of the target node (computed as the number of duplicons in the subsequences
contributed by a parent divided by the total number of duplicons in the target). Similarly,
66 blocks exhibited at least two parents that contributed, respectively, at least 20% of their
constituent duplicons. And 52 exhibited at least two parents that contributed, respectively,
at least 25% of the content of the target block.
                                                                                           49


                                      (a)            (b)

Figure 3.10: (a) Component comprised entirely of duplication blocks from clade ‘chr16’ in the
maximum parsimony DAG. (b) Maximum likelihood DAG for subgraph induced on nodes in (a).


[32] performed an initial analysis of the duplication blocks. They defined 24 clades, or
groups of duplication blocks derived from a common ancestor block, by performing hierar-
chical clustering on a matrix representing the shared presence or absence of duplicons for
every pair of blocks. For a given clade they defined a core duplicon as one that appears in at
least 67% of the constituent duplication blocks. They posited that clades represent families
of evolutionarily related duplication blocks and that core duplicons “may have driven the
evolution of the duplication blocks” in a clade.

After constructing the max parsimony DAG, we colored the nodes in a post-process ac-
cording to the clades described in [32]. We found a strong correspondence between Jiang
et al.’s clades and connected subgraphs in our DAG; 5 of the 9 connected components with
at least 4 blocks were comprised of duplication blocks belonging to a single clade and 7
of the 9 components were comprised of blocks belonging to no more than 2 clades. For
example, see Fig. 3.10(a) and 3.11(a). In larger components, nodes from a single clade
frequently induce a connected subgraph. For example, see Fig. 3.8. We performed a χ2
test of independence between the members of the clades defined by [32] and the members
of connected components in our graph. We restricted the test to clade and components with
at least 18 members. We found that there was a strong relation between the partition of our
graph into connected components and the clade analysis done by [32] (P < 0.0001). The
analysis done by [32], therefore, corroborates our own conclusions.

Our DAG also reveals which duplication blocks may have seeded many other blocks (i.e.
those with high out-degree). For example, in Fig. 3.8, block 399 exhibits eight children and
is an inflection point for the component. Moreover, the edge from block 399 to 405 links
                                                                                          50


                                    (a)            (b)
                                    (a)             (b)


Figure 3.11: (a) Component comprised entirely of duplication blocks from clade ‘chr10’ in the
maximum parsimony DAG. (b) Maximum likelihood DAG for subgraph induced on nodes in (a).


blocks from the the ‘M1’ and ‘chr7 2’ clades. Even though the blocks 399 and 405 belong
to different clades, 405 is very “close” to 399 in duplication distance: block 405 contains
only 71 duplicons, but it shares a subsequence of 29 duplicons with block 399. This link
suggests that the entirety of clade ‘chr7 2’ was descended from clade ‘M1’ in an optimal
duplication history.

Also implicit in the DAG is information about which duplicons are duplicated from one
block to another in an optimal duplication history. We define the inheritance network
for each duplicon as the subgraph induced on the edges on which that duplicon is passed
from parent to child. The average size of an inheritance network was 5.5 edges with a
standard deviation of 10.4. As expected, the 81 core duplicons identified by [32] were
more promiscuous, on average, than non-core duplicons with a mean inheritance network
size of 21. Interestingly, a comparison of the inheritance networks for core and non-core
duplicons revealed that many non-core duplicons exhibit larger inheritance networks within
subgraphs induced by a clade than many of the core duplicons. For example, non-core
duplicon 6970 appeared on 36 of the 63 total edges in the subgraph induced by clade ‘M1’
(shown in blue in Fig. 3.8) and does not appear on any other edge in the graph. We propose
6970 as a new core duplicon for this clade and suggest that others like it should also be
categorized as core duplicons.

Moreover, we found inheritance networks for many conserved subsequences of duplicons
that were nearly as prominent as those for individual core duplicons. For example, the
subsequence [6968, 6967, 6925, 6963, 6962] of duplicons appears on 23 of the edges in the
                                                                                         51

subgraph induced by ‘M1’ clade nodes (shown as green edges in Fig. 3.7). Similarly, the
sequence [7039, 7036, 7037] exhibits a connected inheritance network of 7 edges within
the subgraph induced on clade ‘M2,’ and [9448, 9449] exhibits an inheritance network of
7 edges within the subgraph induced on clade ‘chr16’ that includes an inheritance path of
length 5 (shown also in Fig. 3.7). By delineating the inheritance networks of duplicon
subsequences that are conserved across duplication blocks, we can learn about which du-
plicons were duplicated and transposed conjointly. This type of analysis was impossible
using only the clade annotations of [32].

3.2.7   Maximum Likelihood Reconstruction

We computed the maximum likelihood DAGs (Def. 20) for the sets of duplication blocks
appearing within moderately-sized connected components of the maximum parsimony DAG
in order to compare the two methods. We chose the components comprised of blocks from
clades ‘chr16’ and ’chr10’, respectively (in Fig. 3.7). The maximum likelihood subgraphs
for these subproblems are shown in Fig. 3.10(b) and 3.11(b).

The two DAGs for the ‘chr16’ component in Fig. 3.10 share some characteristics. For
example, node 121 is a common ancestor of every other block and block 276 exhibits
high out-degree in both solutions. Both solutions are similarly “good” with respect to the
parsimony objective: the solution in (a) exhibits an optimal parsimony score of 397, and the
one in (b) exhibits a score of 401. However, the likelihood score for the parsimony solution
in (a) was nearly zero. One difference that accounts for this discrepancy is the higher
average in-degree for blocks in the parsimony solution (2.2) as compared to the likelihood
solution (1.3). Also, the parsimony solution exhibits a path with ten edges, whereas the
longest path in the likelihood solution has six.

Some of these differences are due to the fact that the parsimony criterion does not penalize
edges that do not directly improve the score. For example, block 291 has two parents
(276 and 25) in the parsimony DAG but only one parent (276) in the likelihood DAG.
However, the duplication distance with source 276  25 and target 291 is the same as
the duplication distance with source 276 and target 291. Therefore, the edge from 25 to
291 does not improve the parsimony score, underscoring that there are multiple optimal
parsimony solutions. In contrast, the likelihood of a target block generally increases as
the sum of the lengths of its parent blocks decreases, so the max likelihood DAG will not
include edges that do not directly improve the score.
                                                                                         52


3.2.8   The Simulated Annealing Heuristic

[29] describes an elegant approach for moving locally in the space of DAGs via three types
of simple moves- adding a new edge, removing an existing one, or reversing an existing
one.
Definition 21. Given DAG G = (V, E), the DAG G0 = (V, E 0 ) neighbors G if and only if
we can obtain G0 from G with a single move- adding a new edge, removing, or reversing
an existing edge.
Definition 22. Given an objective function f and two DAGs G1 , G2 we call ∆G = f (G1 )−
f (G2 ) the difference in their energies.

The simulated annealing strategy can be summarized as follows. Given a DAG G = (V, E)
and a randomly proposed move, we examine whether the move is legal (i.e. does not
induce a cycle) and, if so, we perform the move with probability p = exp( −∆G
                                                                           T
                                                                              ), where
T is a temperature parameter. We note that depending on the complexity of the objective
function f (G) computing ∆G could be very expensive. In fact, this is the case for the
max likelihood reconstruction because computing P r[Y | X, k] takes in the worst-case
O(| Y |3 | X | k 2 ). Therefore, we employ a hashtable to store the cost of every move we
have examined. As we do hundreds of independent trials we may often need to examine
the same move multiple times, and the hashtable helps significantly speed up the search for
good moves..

In our implementation we employ an exponential cooling schedule schedule. The temper-
ature is updated via the equation Tt+1 = Tt α. We determined empirically that α = 0.98
performs best in terms of efficiency and time.

The simulated annealing heuristic often terminated in local optima. For a particular in-
stance, the solutions found by all 300 trials would include many globally suboptimal so-
lutions. However, many of the locally optimal solutions encountered were “close” to the
score for the best solution found. For example, the search for the max parsimony evolu-
tionary history given in Fig. 3.10(a) resulted in a component whose objective score is 397;
more than 1/6 of the total trials returned solutions whose objective scores are no more than
407 and well over 1/2 of the total trials returned solutions whose objective scores are no
more than 437 (see Fig. 3.12).
                                                                                                                                  53


                                                              60


                                                              50


                           Num local optimal returned by SA
                                                              40


                                                              30


                                                              20


                                                              10


                                                              0
                                                              380   400   420   440   460    480    500   520   540   560   580
                                                                                       Parsimony score


Figure 3.12: Results of 300 trials of simulated annealing (SA) heuristic: number of local optima
returned by SA vs. objective scores. Results are from search for max parsimony evolutionary history
for component comprised of duplication blocks from clade ‘chr16’ whose global optimum is given
in Fig.3(a).


3.3    Conclusion and Future Directions

We have given several methods for constructing a putative history of human segmental du-
plications. First, we defined the problem of computing a most-parsimonious duplication
history for segmental duplications that is consistent with the so-called two-step model of
segmental duplication. We computed an optimal putative duplication history for a set of
human segmental duplications using the duplication distance algorithm described in Chap-
ter 2 and an integer linear program problem formulation. We then refined the segmental
duplication history reconstruction problem by relaxing the constraint that the history must
be consistent with the two-step model, and instead be consistent more generally with a
multi-step model. We defined a probabilistic model of duplication and gave an efficient al-
gorithm for computing the probability of a duplication scenario for a pair of signed strings
by computing the partition function of the weighted ensemble of feasible sets of duplicated
segments. Then we defined the problem of constructing an optimal segmental duplication
history DAG with respect to both parsimony and likelihood criteria. Finally, we computed
a near-optimal putative duplication history DAG for the same set of human segmental du-
plications with respect to the parsimony criterion and a near-optimal putative duplication
history DAG for a subset of the same human segmental duplications with respect to the
likelihood criterion using.
                                                                                       54

Our maximum parsimony and maximum likelihood DAG reconstructions show some dif-
ferences, both from each other and from the analysis of [32]. In particular, we identify
non-core duplicons and subsequences that are arguably as promiscuous within a clade as
core duplicons.

There are several directions for future work. From a theoretical perspective, one can in-
corporate other types of operations into the probabilistic model, such as deletions and
inversions (which we described in the parsimony setting in Chapter 2), as well as single
nucleotide mutations. Also, our method could be used to sample over the space of DAGs
using a Markov Chain Monte Carlo (MCMC) strategy. From the perspective of applica-
tions, a more comprehensive analysis of genes or other elements in the newly identified
core duplicons and core subsequences from our reconstruction is warranted, as is a further
refinement of the clade annotation by analyzing the clade-induced subgraphs of the DAGs.
         COMPLETING A PARTIALLY
        ASSEMBLED GENOME USING
       REARRANGEMENT DISTANCE

The growing abundance of cancer genome sequence data (see [51], for example) under-
scores the need for efficient and reliable algorithms for whole-genome assembly. However,
assembling a genome accurately is still a costly and labor-intensive process. The need to
survey many tumor samples means that the sequencing and finishing efforts devoted to a
single genome are likely to be relatively small so we need computational methods to aid
the process.

Unfortunately, in order to keep the cost of sequencing a single cancer genome down, the
sequence coverage may be low. The effect is that a cancer genome may not be able to be
reconstructed fully from the sequence data. Instead, the reconstruction may consist of a
fragmented genome comprised of multiple DNA contigs whose sequences are known but
whose relative ordering cannot be inferred directly, and some regions may not have been
measured at all. While this fragmented representation of a cancer genome can be used,
for example, to identify regions of amplification or deletion, we cannot fully appreciate the
effects of large-scale rearrangements without full cancer genome reconstructions.

In some cases, we have a good idea of what the full genome should look like. Unlike when
sequencing a new species, in assembling a cancer genome, we know the architecture of the
“starting genome,” i.e. the human genome prior to somatic mutation. We assume, conser-
vatively, that a human cancer genome will exhibit some mutations, but will, by and large,
resemble a healthy human genome. Given a partially assembled cancer genome comprised
of multiple fragmented contigs, we can reconcile it with a reference (healthy) genome to
construct a full cancer genome that contains all the assembled contigs as subsequences but
that is otherwise as similar to the reference genome as possible. In particular, where regions
                                             55
                                                                                           56

may not have been measured in the cancer sequence data, we assume that the those regions
resemble the reference genome. Moreover, whenever there is ambiguity in how to arrange
the fragmented cancer genome contigs, we err on the side of conservatism and use the refer-
ence genome as our guide. This process, known as resequencing a cancer genome, requires
fewer clones and is less labor-intensive than de novo whole genome fragment assembly.

In this chapter, we consider the Block Ordering Problem (BOP) that was introduced by
Gaul and Blanchette [28]. The input to the BOP is a pair of partially assembled genomes
where all the regions of the genomes are sequenced, but the genomes are fragmented into
contigs (blocks) that need to be ordered and oriented such that the similarity between the
resulting pair of genomes is maximized. In the formulation of [28], the measure of similar-
ity between a pair of genomes is given by the number of cycles in their breakpoint graph
([5]), and they give a linear-time algorithm to solve it.

Here we consider a special case of the BOP where the input is one fully sequenced refer-
ence genome and one partially assembled genome where some regions may not have been
sequenced. This special case is motivated by the problem of resequencing a cancer genome:
given a fully sequenced reference genome and a partially assembled cancer genome com-
prised of multiple contigs possibly with some missing regions, complete the cancer genome
in such a way that it contains all the assembled contigs as subsequences and such that the
similarity between the cancer genome and the reference genome is maximized. We note,
however, that the techniques we present can be extended in a straightforward manner to
solve the general BOP in which both input genomes are only partially assembled.

Unlike [28], here we use the double-cut-and-join (DCJ) distance metric between genomes
as a measure of similarity. First introduced by [59], DCJ distance is an efficiently com-
putable metric that models a number of basic rearrangements, such as inversions, transloca-
tions, fusions, fissions, transpositions, and block interchanges. In [59], the authors present
a linear-time algorithm for computing DCJ distance between a pair of genomes using the
breakpoint graph data structure introduced by [5]. Later, [9] introduced a new data struc-
ture, the adjacency graph, that simplifies the algorithms for computing DCJ distance and
for computing a sorting sequence of DCJ operations.

In this chapter, we show that solving the BOP with respect to DCJ distance is equivalent
to solving it with respect to the number of cycles in the breakpoint graph; in fact, another
contribution of this thesis is a proof that the number of cycles in the breakpoint graph for
                                                                                           57

a pair of genomes is equal to the number of cycles in their adjacency graph using cycle-
preserving graph transformations. Moreover, just as [9] simplified the presentation of the
algorithms for computing DCJ distance using an adjacency graph, here we simplify the
presentation of the algorithm for solving the BOP by using the adjacency graph (instead
of the breakpoint-graph-based framework used by [28]). We differentiate between two
variants of the problem and give linear-time algorithms to solve them both. Finally, [28] do
not give a full proof of correctness for their algorithm; we complete the proof of correctness
for our algorithm.

In the last section of this chapter, we discuss how our adjacency-graph-based framework
for solving the BOP might give insight into a more general problem of constructing a set
of unknown cancer genomes given an ambiguous set of sequence measurements.

4.1      Related Work

As mentioned previously, the work in this chapter is closely related to that of Gaul and
Blanchette [28] who introduced the Block Ordering Problem (BOP). The BOP was in-
spired by DNA sequencing technology that generates whole-genome sequencing data with
relatively low coverage, inevitably resulting in the construction of a fragmented genome
comprised of sequenced contigs whose relative ordering cannot be inferred directly. In par-
ticular, the problem is defined as follows: given two signed permutations (genomes) that
are broken into blocks, order and orient each set of blocks in such a way that the number
of cycles in the breakpoint graph of the resulting permutations is maximized, which they
note “has been shown to approximate very well the [minimum] reversal distance between
them.”

In this work we use a more recently introduced measure of similarity between genomes
as our objective criterion, namely the double-cut-and-join distance. Introduced in [59],
the double-cut-and-join (DCJ) operation cuts a genome in two locations and then fuses
the new ends in a different orientation. The authors present a linear-time algorithm for
computing the DCJ distance between a pair of permutations using the breakpoint graph
framework originally introduced in [5] to compute the reversal distance between a pair of
permutations. [59] shows that the DCJ distance between genomes A and B, dDCJ (A, B),
obeys:
                           dDCJ (A, B) = b(A, B) − c(BGA,B ),                            (4.1)

where b(A, B) is the number of breakpoints and c(BGA,B ) is the number of cycles in the
                                                                                           58

breakpoint graph of A and B. They give a quadratic-time algorithm for computing a sorting
sequence of DCJ operations based on the breakpoint graph of the two genomes. Interest-
ingly, [59] observe that reversals, transpositions, and block interchanges with weights one,
two, and two, respectively, can all be modeled by a single DCJ operation.

The DCJ operation was subsequently revisited by Bergeron et al. [9] who present a new
framework for computing DCJ distance without using the breakpoint graph of [5]. Instead,
in [9], the authors present a linear-time algorithm for computing the distance between two
signed permutations with respect to the DCJ metric based on an adjacency graph construc-
tion. In particular, they show the following.
Theorem 7 (Bergeron et al., Thm 1). Given genomes A and B on the same set of N genes,
the DCJ distance is:
                                                                    
                                                     i(AGA,B )
                       dDCJ (A, B) = N − c(AGA,B ) +                     ,               (4.2)
                                                        2

where N is the number of genes, c(AGA,B ) is the number of cycles in the adjacency graph
and i(AGA,B ) is the number of paths with an odd number of edges in the adjacency graph.


They also give a linear-time algorithm for computing a sorting sequence of DCJ operations.

The connection between reversals and DCJ operations when applied to a single circular
chromosome was also described by [9]. In particular, a single DCJ operation when applied
to a circular chromosome can result in a reversal or in a cycle fission (in which the circular
chromosome becomes two circular chromosomes). Similarly, a DCJ operation on a pair of
circular chromosomes can result in a cycle fusion (in which the 2 chromosomes become
a single circular chromosome). Therefore, even if the start/end genomes are known to
be circular-unichromosomal, a sequence of DCJ operations transforming one genome into
another may create some intermediate genomes that are circular-multichromosomal.

Although the relationship between reversal operations and DCJ operations has been studied
previously, here we make explicit the relationship between the breakpoint graph framework
of [5] and the adjacency graph framework of [9]. In particular, we prove that the number
of cycles in a breakpoint graph for a pair of permutations is equal to the number of cycles
in the corresponding adjacency graph for the same pair. As a consequence, we can show
that the algorithm given by [28] for the BOP yields a pair of genomes whose DCJ distance
is minimum. The solution given in [28], however, is complicated; the algorithm relies on
                                                                                                59

a breakpoint graph framework and requires the construction first of a fragmented break-
point graph (a generalization of the breakpoint graph requiring four colors to distinguish
different types of vertices) and then of a block ordering graph (a vertex-bicolored and edge-
bicolored graph constructed from the fragmented breakpoint graph). Finally, they give an
algorithm to processes each type of component in the block ordering graph in succession.
The algorithms we present below are simpler, relying on an adjacency graph framework.

Furthermore, the proof of correctness for the algorithm presented in [28] relies on some
assumptions that are not proven explicitly. As stated above, the authors solve the BOP by
ensuring the number of cycles in the resulting breakpoint graph is maximized. Given the
input pair of partially assembled genomes, they construct a fragmented breakpoint graph
(a generalization of a breakpoint graph) that exhibits, initially, some number of cycles.
The argument then is that the total number of new cycles constructed by their algorithm
is optimal. Although they quantify the number of new cycles that are constructed by their
algorithm, they do not argue that this number is an upper bound and therefore optimal. (See
Appendix C for additional discussion of the results presented in [28].) Here we prove the
optimality of the algorithms we present.

4.2    Preliminaries

A gene is an oriented sequence of nucleotides, starting from its tail and ending at its head.
The tail and head of a gene are referred to as its extremities. We denote the tail of a gene
a as at and its head as ah . For gene a, the extremities at and ah are obverse extremities,
denoted at = a
             ¯h and ah = a
                         ¯t . When two genes appear consecutively on a chromosome,
two of their extremities form an adjacency. An adjacency is represented as an unordered
set of two extremities. For example, if genes a and b appear consecutively, they may form
any of four possible adjacencies: {ah , bt }, {ah , bh }, {at , bt }, {at , bh }. An extremity that
appears at the end of a linear chromosome and is therefore not adjacent to any other gene
extremity is called a telomere and is represented as a singleton set, such as {at }.

A genome on N genes is a set of adjacencies and telomeres in which each of the 2N
corresponding extremities is a member of exactly one element of the genome. The genome
graph for genome A, denoted GA , is a graph with nodes corresponding to the extremities
in elements of A and edges between pairs of obverse extremities and extremities that are
elements of the same adjacency in A, i.e. the set of edges is {(u, v) | u = v¯ or {u, v} ∈
A}. Note that GA is a graph that contains only nodes with degree one or two and that
                                                                                                      60


                         !"#         !$#         %"#         %$#         &$#          &"#


                        '$#          '"#         ("#         ($#


                         )"#         *$#         *"#         )$#


Figure 4.1: The genome graph for the genome A. There are 14 nodes, representing the extremities
of the 7 genes exhibited in A. Each pair of obverse extremities defines an edge, and each adjacency
in A defines an edge. The genome graph exhibits 3 connected components, corresponding to 3
constituent chromosomes. The top 2 chromosomes are linear and the bottom one is circular.


it may contain cycles. Given a genome A with N extremities, its genome graph can be
constructed in O(N ) time in a straightforward manner.

A chromosome in a genome A is the subset of adjacencies and telomeres containing all the
extremities in a single connected component in GA . The chromosomes of a genome can be
inferred directly from its genome graph. A chromosome is linear if it contains exactly two
telomeres and circular if it contains no telomeres. Note that a genome may contain either
linear or circular chromosomes or a mixture of both.

For example, consider the following genome and its genome graph given in Fig. 4.1.

      A = {{at }, {ah , bt }, {bh , ch }, {ct }, {dh }, {dt , et }, {eh }, {ft , gh }, {gt , fh }}.

A contains seven genes and three chromosomes, two of which are linear and one of which
is circular.

We also define a partial genome on N genes to be a set of adjacencies and telomeres in
which each of the 2N extremities can be a member of no more than one element. Similarly,
the partial genome graph for a partial genome X on N nodes is the graph induced on
the 2N corresponding extremities of the N genes with edges between pairs of obverse
extremities and pairs of extremities that are elements of an adjacency in X . See Figure 4.2
for an example.
Definition 23. A genome is circular-unichromosomal if and only if it contains exactly one
chromosome and no telomeres.
                                                                                                         61


                     !"#               !$#         %$#        %"#


                     &$#               &"#         '"#        '$#             *"#      *$#


Figure 4.2: A partial genome graph for the partial genome X = {{ah , bh }, {ct , dt }, {dh , et }} on
genes a, b, c, d, e. The excluded adjacencies are E(X ) = {{at , bt }, {ch , eh }}.


                  A         !"#$%&'(         !%#$)#'(    !*&$+&'(          !)&$*#'(   !"&$+#'(


                                                                    (%)#

                   B        !"#$%&'(         !%#$*&'(    !)#$+&'(          !)&$"&'(   !*#$+#'(


Figure 4.3: The adjacency graph for genomes A = {{ah , bt }, {bh , ch }, {dt , et }, {ct , dh }, {at , eh }}
and B = {{ah , bt }, {bh , dt }, {ch , et }, {ct , at }, {dh , eh }}. The graph contains 3 cycles.


For a pair of genomes, A and B, that contain the same set of genes, we define the adjacency
graph, denoted AGAB , as the bipartite graph whose vertices correspond to the elements of
A and B. For the pair a ∈ A, b ∈ B, there are | a ∩ b | edges between a and b in the
adjacency graph. Therefore, a node in an adjacency graph may have degree one (for a
telomere) or two (for an adjacency). An adjacency graph can contain only simple paths
and simple cycles. Note that an adjacency graph may contain parallel edges (or a simple
cycle of length two). The adjacency graph for a pair of genomes on N genes can contain at
most N cycles. See Fig. 4.3. Given a pair of genomes on N genes, we can construct their
adjacency graph in O(N ) time in a straightforward manner.

Similarly, we can define a partial adjacency graph for a pair of partial genomes or one
complete genome and one partial genome. See Figure 4.4 for an example. Note that a node
may have degree 0 in a partial adjacency graph.

In [9], the authors proved that the DCJ distance between a pair of genomes defined on the
same set of N genes can be computed by analyzing the adjacency graph. (See Thm 7.)
                                                                                                       62


                  A       !"#$%&'(              !%#$)&'(              !)#$*&'(   !*#$+&'(   !+#$"&'(


                   X                 !"#$%#'(              !)&$*&'(              !*#$+&'(


Figure          4.4:        A       partial       adjacency,"-( graph  for     the    genome   A =
{{ah , bt }, {bh , ct }, {ch , dt }, {dh , et }, {eh , at }}    and   partial      genome    X   =
{{ah , bh }, {ct , dt }, {dh , et }} on genes a, b, c, d, e. The missing extremities are M(X ) =
{at , bt , ch , eh }. The unsatisfied pairs are U(A, X ) = {{bt , ch }, {eh , at }}.


4.3    The Breakpoint Graph and the Adjacency Graph

In the previous sections, we described the adjacency graph for a pair of genomes. Here
we present the breakpoint graph structure introduced by [5]. We only discuss the break-
point graph structure for pairs of circular-unichromosomal genomes, although the structure
can be generalized for linear-unichromosomal genomes as well. First, in [5], the authors
assume that genomes are represented by signed permutations on an alphabet of genes. A
signed permutation π can be transformed into the equivalent set representation of a genome
Gπ described in Section 4.2. First, we transform π into an unsigned permutation π 0 on 2N
markers: for each gene in π with positive orientation, such as +a, replace it in π 0 by the
consecutive pair of extremities at , ah , for each gene in π with negative orientation, such
as −a, replace it in π 0 by the consecutive pair of extremities ah , at . Then, for each pair of
successive extremities i, j such that i and j are not obverse extremities (i.e. i 6= ¯j) in π 0 ,
add the adjacency {i, j} to Gπ . Also add an adjacency corresponding to the first and last
extremities in π 0 .

A breakpoint graph for a pair A, B of circular-unichromosomal genomes on N genes con-
tains 2N nodes – one for each extremity. The breakpoint graph contains three types of
edges. Pairs of obverse extremities are connected with obverse edges. For every adjacency
{i, j} ∈ A, i and j are connected with a black edge. For every adjacency {i0 , j 0 } ∈ B, i0
and j 0 are connected with a green edge. A cycle in the breakpoint graph is a green-black
alternating cycle. See Figure 4.5 for an example.

In this section, we show explicitly the connection between the breakpoint graph and the
                                                                                                                       63


Figure         4.5:               The         breakpoint         graph      for        genomes            A             =
{{ah , bt }, {bh , ch }, {dt , et }, {ct , dh }, {at , eh }} and B = {{ah , bt }, {bh , dt }, {ch , et }, {ct , at }, {dh , eh }}.
Black edges correspond to adjacencies in A, and green edges correspond to adjacencies in B.
Dashed edges represent obverse edges. There are three green-black alternating cycles.


adjacency graph for a pair of genomes: we prove that the number of cycles in the break-
point graph for a pair of unichromosomal genomes is equivalent to the number of cycles
in their adjacency graph. That is, we prove that for unichromosomal genomes A and B,
c(BPA,B ) = c(AGA,B ). To our knowledge, this connection between the two graphs has not
been shown explicitly before.
Lemma 7. Given a pair of circular-unichromosomal genomes, A, B, on N genes, the
number of cycles in the breakpoint graph equals the number of cycles in the adjacency
graph.

Proof: To prove the equivalence of the two graphs, we give a series of cycle-preserving
graph transformation operations that converts the adjacency graph into the breakpoint graph.
Start with the adjacency graph AGA,B . For each adjacency in A, split the node into two –
corresponding to the two extremities – and connect them with a black edge. For each ad-
jacency in B, split the node into two – corresponding to the two extremities – and connect
them with a green edge. In both A and B, connect obverse extremities with obverse edges.
(Note that the set of vertices in A is equal to the set of vertices in B, and the obverse edges
between extremities in A will be the same as the set of obverse edges between extremities
in B.) We then merge all identical pairs of vertices from A and B. We delete one of the two
parallel copies of each obverse edge. The resulting graph is a breakpoint graph for A and
B and there is a one-to-one correspondence between cycles in the original adjacency graph
and cycles in the resulting breakpoint graph.                                                                           

In [30], the authors show that for circular-unichromosomal genomes A and B on N genes,
the reversal distance obeys:

                                    dreversal (A, B) = N − c(BPA,B ) + t,                                           (4.3)
                                                                                          64

where c(BPA,B ) is the number of cycles in the breakpoint graph and t is usually a small
constant that accounts for the number of operations needed to handle the so-called hurdles
and fortresses in the breakpoint graph. By comparison, [9] show that the DCJ distance
obeys:
                              dDCJ (A, B) = N − c(AGA,B ),                              (4.4)

where c(AGA,B ) is the number of cycles in the adjacency graph. Therefore, Lemma 7
implies that the only difference between the reversal distance and the DCJ distance for a
pair of genomes owes to the existence of hurdles or fortresses in the breakpoint graph and
can be quantified by t.

Another consequence of Lemma 7 is that the goal of solving the BOP with respect to the
number of cycles in the breakpoint graph is equivalent to solving the BOP with respect to
the number of cycles in the adjacency graph. Hence, although they state that their goal is to
minimize the approximate reversal distance between the two final genomes, the algorithm
of [28] maximizes the number of cycles in their adjacency graph and thus also minimizes
the DCJ distance between them.

4.4      Problem Formulation

Here, we use the formulation of [9] and an adjacency-graph approach to simplify the result
of [28] to solve the block ordering problem with respect to the DCJ metric. In our discus-
sion, we consider a special case of the problem, that we call the Completion Problem, in
which one of the two input genomes is a fully sequenced reference genome, and the goal
is to complete a partially assembled genome so as to maximize its similarity to the known
reference genome. We describe two variants of the completion problem – one in which the
form of the final assembled genome is unconstrained and one in which the final genome is
required to be comprised of a single, circular chromosome. We note that our algorithms for
the completion problem can be extended to solve the BOP more generally.

We begin with some definitions.
Definition 24. A completion of a partial genome X is a genome Xˆ such that X ⊆ Xˆ .
Definition 25. A circular-unichromosomal completion of a partial genome X is a genome
Xˆ such that X ⊆ Xˆ and Xˆ is circular-unichromosomal.
Definition 26. Given a circular-unichromosomal genome A on N genes and a partial
genome X on N genes containing only adjacencies, the Unrestricted Completion Problem
is to find a completion Xˆ of X such that dDCJ (A, Xˆ ) is minimum.
                                                                                          65

Definition 27. Given a circular-unichromosomal genome A on N genes and a partial
genome X on N genes containing only adjacencies, the Restricted Completion Problem is
to find a circular-unichromosomal completion Xˆ of X such that dDCJ (A, Xˆ ) is minimum.

We note that, as its name suggests, the restricted version of the problem is more constrained
than the unrestricted version.
Claim 1. Given a circular-unichromosomal genome A, an optimal solution XˆU to the un-
restricted completion problem, and an optimal solution XˆR to the restricted completion
problem, dDCJ (A, XˆU ) ≤ dDCJ (A, XˆR ).

Proof: Any circular-unichromosomal completion Xˆ of X is also a completion of X . There-
fore, an optimal solution XˆR to the restricted version of the completion problem is also a
(possibly not optimal) solution to the unrestricted version of the problem. Thus, the dis-
tance dDCJ (A, XˆR ) is an upper bound on the distance dDCJ (A, XˆU ) to an optimal solution
to the unrestricted version of the problem.

4.5      The Algorithm

4.5.1    The Unrestricted Problem

First, we give an algorithm for the unrestricted version of the completion problem. Our
input to the problem is a reference genome A that is circular-unichromosomal and a set
of measured adjacencies X that contain each extremity that appears in an element of A at
most once. The goal is to complete the partial genome X by adding new adjacencies and
telomeres to it in order to build the full genome that is most similar to A with respect to
DCJ distance.

By Thm. 7, any strategy that will complete a partial genome X in such a way that the DCJ
distance between the completion of that genome Xˆ and a given reference genome A will
be minimal will maximize the value of c(AGA,Xˆ ) + i(AGA,Xˆ )/2 in the resulting adjacency
graph.

Let us consider the partial adjacency graph AGA,X for A and X . Recall that it is bipar-
tite. Note that because A is circular-unichromosomal, it contains only adjacencies – no
telomeres. Also note that X is defined to be a set of adjacencies. Moreover, because A is
a complete genome, all the 2N extremities are represented, so all the nodes in X will have
degree 2 in the partial adjacency graph. In other words, the partial adjacency graph must
contain exactly 2 | X | edges. Consequently, the partial adjacency graph contains only
                                                                                                66

cycles of even length and even-length paths whose two endpoints are in A.

There are | X | adjacencies in X , and because each adjacency contains two extremities, that
means that 2 | X | of the 2N total extremities are represented in X . The rest are missing.
Definition 28. Let M(X ) be the set of 2N −2 | X | missing extremities that do not appear
in X .

For example, in Figure 4.4, M(X ) = {at , bt , ch , eh }. These are the extremities that must
be added to X in order to complete it.
Claim 2. Let X be a partial genome on N genes, and let P be any partition of the elements
                    |M(X )|
of M(X ) into         2
                              = N − | X | adjacencies. X ∪ P is a completion of X .

The claim follows from the definition of a completion; the set X ∪ P contains X and
exhibits all 2N extremities.
Claim 3. Given a partial genome X and a perfect matching M on M(X), X ∪ M is a
completion of X .

The claim follows directly from Claim 2; the perfect matching M is a partition of M(X )
       |M(X )|
into     2
                 adjacencies.

Consider again the partial adjacency graph. Note that an extremity m ∈ M(X ) must
appear as a member of a node a ∈ A that has degree zero or one in AGA,X .
Definition 29. Given a circular-unichromosomal genome A, and a partial genome X , let
u, v be extremities in M(X ). {u, v} is an unsatisfied pair provided either:

   1. {u, v} ∈ A, or

   2. there exist u0 , v 0 such that {u, u0 } and {v, v 0 } are degree-one nodes at opposite ends
         of an even path in AGA,X .

See Fig. 4.4 for an example.
Definition 30. Given a circular-unichromosomal genome A, and a partial genome X ,
U(A, X ) is the set of all unsatisfied pairs of extremities.
                                             |M(X )|
Note that | U(A, X ) |= N − | X |=             2
                                                     ;   every extremity in M(X ) is a member of an
unsatisfied pair.
Claim 4. Given a circular-unichromosomal genome A, a partial genome X , and an unsat-
isfied pair {u, v} ∈ U(A, X ), the partial genome X 0 = X ∪ {{u, v}} will exhibit a partial
adjacency graph such that c(AGA,X 0 ) = c(AGA,X ) + 1, where c(AGA,X ) is the number of
cycles in the partial adjacency graph for A and X (and similar for c(AGA,X 0 )).
                                                                                                  67


Figure 4.6: An illustration of Claim 4. The partial adjacency graph from Fig. 4.4 is augmented to
show both of the possible cases for processing an unsatisfied pair. The red node and edges illustrate
the first case – adding an adjacency to X that corresponds to an unsatisfied pair that is also an
adjacency in A, yielding a new cycle of length two. The green node and edges illustrate the second
case – adding an adjacency to X that corresponds to an unsatisfied pair that appear at opposite ends
of an even path, transforming the even path into a new cycle.


Proof: Let AGA,X be the partial adjacency graph for A and X . Because {u, v} is an
unsatisfied pair, there are two possible cases: either (1) {u, v} ∈ A or (2) there exist u0 , v 0
such that {u, u0 } and {v, v 0 } are degree-one nodes at opposite ends of a path with l edges
in AGA,X where l is even. (In the partial adjacency graph in Fig. 4.4, the unsatisfied pair
{eh , at } is an example corresponding to the first case, and the unsatisfied pair {bt , ch } is
an unsatisfied pair corresponding to the second case.) Suppose we add {u, v} as a new
adjacency to X , yielding X 0 . In the first case, the node {u, v} in A will have degree zero in
AGA,X . Therefore, adding {u, v} to X will create a new cycle of length two in the resulting
adjacency graph AGA,X 0 . In the second case, adding {u, v} to X as a new adjacency, will
induce edges between {u, v} in X 0 and {u, u0 } in A and between {u, v} in X 0 and {v, v 0 }
in A, transforming the length-l path between {u, u0 } and {v, v 0 } into a cycle with l + 2
edges in AGA,X 0 . (See Fig. 4.6 for an illustration.)                                            

As noted above, for the unrestricted completion problem, an optimal completion Xˆ of
partial genome X must exhibit a maximum number of cycles and odd paths in the resulting
adjacency graph for the circular-unichromosomal reference genome A and the completion
Xˆ , AG ˆ . Recall that a partial adjacency graph for a circular-unichromosomal genome
        A,X
and a partial genome containing only adjacencies can contain only cycles and paths with
an even number of edges. Also recall that we may complete X by choosing a partition of
the elements of M(X ) into a perfect matching of size N − | X | and then adding those
                                                                                         68

elements of the matching as new adjacencies in Xˆ (by Claim 3). Note that the N − | X |
elements of U(A, X ) are a perfect matching on M(X ). Thus, Xˆ = X ∪ U(A, X ) is a
completion of X . Furthermore, by Claim 4, Xˆ contains c(AGA,X )+ | U(A, X ) | cycles,
where c(AGA,X ) denotes the number of cycles in the partial adjacency graph for A and X .
This is maximum because of the following bound.
Claim 5. Let A be a circular-unichromosomal genome and A be a partial genome. For
any completion Xˆ , c(AG ˆ ) − c(AGA,X ) ≤ |M(X )| =| U(A, X ) |.
                         A,X                     2

The claim follows from the fact that adding a new adjacency to X requires the selection of
two extremities from M(X ) and each new adjacency can increase the number of cycles in
the adjacency graph by at most one.

Thus, there is a straightforward algorithm for solving the unrestricted version of the com-
pletion problem: construct the partial adjacency graph for the reference genome A and the
partial genome X , identify the unsatisfied pairs U(A, X ), and add the unsatisfied pairs to
X to complete the genome Xˆ = X ∪ U(A, X ). The running time is linear in the number
of genes, N . Every unsatisfied pair will yield a new cycle in the resulting adjacency graph
AGA,Xˆ , and this is optimal by Claim 5.
Theorem 8 (Unrestricted Completion Problem). Given a circular-unichromosomal genome
A and partial genome X on N genes, an optimal solution to the unrestricted completion
problem Xˆ of X will exhibit the DCJ distance:

                   dDCJ (A, Xˆ ) = N − [c(AGA,X )+ | U(A, X |],
                                 = N − [c(AGA,X ) + (N − | X |)],                      (4.5)
                                 =| X | −c(AGA,X ).

where c(AGA,X ) is the number of cycles in the partial adjacency graph AG(A, X ).

We note that a solution to the unrestricted version of the problem can exhibit as many as
Ω(N ) chromosomes. Allowing the construction of a genome to include arbitrarily many
chromosomes may not be desirable from a biological perspective. Instead, we may choose
to enforce that the constructed genome contain, for example, a single chromosome. In the
next section, we consider this more restricted version of the completion problem.

4.5.2   The Restricted Problem

Here we give an algorithm for solving the restricted completion problem. The input to this
problem is the same as for the unrestricted version. We can assume again that the input
                                                                                          69

reference genome A only contains adjacencies because it is circular-unichromosomal and
that the input partial genome X only contains adjacencies by definition. Therefore, as
before, the partial adjacency graph AGA,X is comprised of only cycles of even length and
even-length paths that start and end at nodes in A.

In the restricted completion problem, we must complete X such that the resulting genome
Xˆ is circular-unichromosomal. The definition of a circular-unichromosomal genome is that
it contains no telomeres and only one chromosome. The genome graph of a genome that
contains only one chromosome contains only a single connected component. Therefore,
we have the following.
Claim 6. The genome graph of a circular-unichromosomal genome on N genes is com-
prised of a single simple cycle that visits all the nodes (corresponding to all of the 2N
extremities).

Consider the partial genome graph GX . Any completion Xˆ of X will contain GX as a
subgraph because all the original adjacencies in X will remain in the final completion Xˆ .
If X is a valid input (i.e. it can be completed in such a way that the resulting genome is
circular-unichromosomal), we can assume that GX does not contain any cycles. Instead,
GX is comprised of a collection of simple paths.
Definition 31. Given a partial genome X with missing extremities M(X ), let u, v be ex-
tremities in M(X ). {u, v} is an excluded adjacency if u and v are both degree-one nodes
at opposite ends of the same simple path in the partial genome graph GX .
Definition 32. Given a partial genome X , E(X ) is the set of all excluded adjacencies.

See Fig. 4.2 for an example.
Lemma 8. For a partial genome X comprised of only adjacencies with genome graph
GX = (V, E) and a perfect matching M on M(X ), the completion Xˆ = X ∪ M is
circular-unichromosomal if and only if GXˆ = (V, E ∪ M ) is a simple cycle.

Proof: The forward direction of the lemma is a consequence of Claim 6. For the backward
direction, suppose GXˆ is a simple cycle. Then every extremity in GXˆ has degree two and
belongs to the same the connected component. Therefore, GXˆ has exactly one chromosome
and no telomeres and is, thus, circular-unichromosomal.                                   
Lemma 9. Given a partial genome X that contains only adjacencies, let M be a perfect
matching on M(X ). For a partition of E(X ) into two sets E 1 (X ) and E 2 (X ), let M1 (X )
denote the subset of M(X ) equal to M1 (X ) = {u, v ∈ M(X ) | {u, v} ∈ E 1 (X )},
and similar for M2 (X ). There exists a partition of E(X ) into two nonempty sets E 1 (X )
                                                                                           70

and E 2 (X ) such that M can be partitioned into a perfect matching M 1 on M1 (X ) and
a perfect matching M 2 on M2 (X ), if and only if the completion Xˆ = X ∪ M is not
circular-unichromosomal.

Proof: First we prove the forward direction. Suppose M can be partitioned into M 1 , a
perfect matching on M1 (X ), and M 2 , a perfect matching on M2 (X ). By the definition of
an excluded adjacency, for every {u, v} ∈ E(X ), the partial genome graph GX = (V, E)
contains a u-to-v path. Therefore, the partial genome graph induced on (V, E ∪ M 1 ) con-
tains a cycle that visits every extremity in M1 (X ). And similarly, the partial genome graph
induced on (V, E ∪ M 2 ) contains a cycle that visits every extremity in M2 (X ). Thus, the
genome graph GXˆ = (V, E ∪ M 1 ∪ M 2 ) contains two edge-disjoint cycles. Therefore, by
Lemma 8, Xˆ is not circular-unichromosomal.

For the backward direction, suppose the completion Xˆ = X ∪M is not circular-unichromosomal.
We shall show the existence of a partition of E(X ) into two sets for which M can be par-
titioned into two perfect matchings. By Lemma 8, the genome graph GXˆ is not a simple
cycle. But because X contains only adjacencies and M is a perfect matching, every node
in GXˆ has degree two and, thus, it must be comprised of a collection of simple cycles.
Consider one such cycle C1 . Let M1 (X ) be the subset of M(X ) visited by C1 . Note
that for every {u, v} ∈ E(X ), if u is in M1 (X ), then v is in M1 (X ) and C1 contains a
u-to-v subpath that does not contain any other element of M1 (X ). Let E 1 (X ) be the subset
E 1 (X ) = {{u, v} ∈ E(X ) | {u, v} ∈ M1 (X )}. Also let M 1 be the matching on M1 (X )
defined by the set of edges in C1 that do not appear in GX . Let M2 (X ) be the subset of
M(X ) visited by every cycle in GXˆ other than C1 , i.e. M2 (X ) = M(X ) \ M1 (X ). Let
M 2 be the matching on M2 (X ) defined by the set of edges in those other cycles that do not
appear in GX . M 1 and M 2 partition M , and E 1 (X ) and E 2 (X ) partition E(X ). Moreover,
M 1 is a perfect matching on M1 (X ) and M 2 is a perfect matching on M2 (X ).             

We state the contrapositive of the backward direction of Lemma 9 as a corollary.
Corollary 9. Given a partial genome X that contains only adjacencies, let M be a perfect
matching on M(X ). If there does not exist a partition of E(X ) into two sets E 1 (X ) and
E 2 (X ), such that M can be partitioned into perfect matchings on E 1 (X ) and E 2 (X ), then
the completion Xˆ = X ∪ M is circular-unichromosomal.

This corollary characterizes the types of adjacencies we can use to augment a partial
genome X in order to construct a circular-unichromosomal completion. In particular,
we must find a matching M on M(X ) and completion Xˆ = X ∪ M such that for each
                                                                                            71

{u, v} ∈ E(X ), GXˆ contains a simple u-to-v path that contains every edge in M .

Consider the augmentation of X by a single new adjacency {u, v} on elements of M(X ),
yielding the new partial genome X 0 . If the pair {u, v} is not an excluded adjacency, then the
addition of the new adjacency merges two contigs into one longer contig. Suppose the pairs
{u, u0 } and {v, v 0 } were excluded adjacencies in E(X ). The partial genome X 0 will contain
one fewer contig and the set of excluded adjacencies will now contain the extremities found
at opposite ends of that merged contig, namely {u0 , v 0 }. u and v will no longer belong to
any excluded adjacencies because they are not elements of M(X 0 ).

Recall that the objective of the restricted block ordering problem is to find a completion Xˆ
of X that maximizes the number of cycles in the resulting adjacency graph for the reference
genome A and Xˆ . Recall too that augmenting X with a perfect matching on M(X ) induces
a number of new cycles in the adjacency graph without changing the number of cycles that
exist in the partial adjacency graph AGA,X . Again, the number of such new cycles is
bounded by | U(A, X ) |=| M(X ) | /2. In the unrestricted version of the problem, we
were able to achieve this upper bound by adding the adjacencies defined by U(A, X ) to
X . However, Lemma 9 limits our ability to select arbitrarily a set of adjacencies from
M(X ) to add to X . In particular, if the genome graph GX ∪U (A,X ) contains more than one
cycle, the completion X ∪ U(A, X ) is not circular-unichromosomal. For every cycle in the
genome graph for such a multi-chromosomal completion, at least one edge cannot appear
in a circular-unichromosomal completion.
Lemma 10. Given a circular-unichromosomal genome A and a set of adjacencies X , let
Xˆ be a circular-unichromosomal completion of X . The number of cycles in the adjacency
graph for A and Xˆ is bounded by:
                                              | M(X ) |
                c(AGA,Xˆ ) ≤ c(AGA,X ) +                  − c(GX ∪ U (A,X ) ) + 1.       (4.6)
                                                   2
Proof: First, suppose c(GX ∪ U (A,X ) ) = 1. In this case, the lemma follows from the fact that
the maximum number of cycles that can be added to AGA,X by completing X is bounded
     |M(X )|
by     2
        =| U(A, X ) |. Now suppose c(GX ∪ U(A,X ) ) > 1. Then the completion X 0 =
X ∪ U(A, X ) is not circular-unichromosomal. So, Xˆ cannot contain all the unsatisfied
pairs U(A, X ). In particular, for each cycle induced in c(GX ∪ U(A,X ) ), at least one edge
from that cycle cannot be included in Xˆ . Therefore, at least c(GX ∪ U(A,X ) ) elements of
U(A, X ) cannot appear in Xˆ . The lemma follows.                                           

This upper bound is tight and we give an algorithm that achieves this upper bound. The
algorithm is described in Algorithm 1.
                                                                                        72


 Algorithm 1: Restricted Block Ordering Problem with DCJ
   Data: A, circular-unichromosomal genome, X partial genome.
   Result: Xˆ , circular-unichromosomal genome with X ⊆ Xˆ such that dDCJ (A, Xˆ ) is
           minimum.
 1 begin
 2    GX ←− partial genome graph;
 3    M(X ) ←− missing extremities;
 4    AGA,X ←− partial adjacency graph;
 5    U(A, X ) ←− unsatisfied pairs;
 6    Xˆ ←− X ;
 7    E(Xˆ ) ←− excluded adjacencies;
 8    % main for loop;
 9    for {u, v} ∈ U(A, X ) do
10        if {u, v} ∈  / E(Xˆ ) then
11             let u be such that {u, u0 } ∈ E(Xˆ );
                     0

12             let v 0 be such that {v, v 0 } ∈ E(Xˆ );
13             E(Xˆ ) ←− E(Xˆ ) ∪ {u0 , v 0 } \ {u, u0 } \ {v, v 0 };
14             Xˆ ←− Xˆ ∪ {u, v};
15             M(X ) ←− M(X ) \ {u, v};
16      while M(X ) 6= ∅ do
17         for i, j ∈ M(X ) do
18             if {i, j} ∈
                         / E(Xˆ ) then
19                  Xˆ ←− Xˆ ∪ {i, j};
20                  M(X ) ←− M(X ) \ {i, j};

21      Output Xˆ ;
22   end
                                                                                          73

The running time is linear in the number of genes, N .

The algorithm achieves the upper bound given in Lemma 10 by greedily adding unsatisfied
pairs to the partial genome X as long as they do not induce a cycle in the partial genome
graph GXˆ . This is verified in the main for loop by checking whether an unsatisfied pair is
also a member of the set of excluded adjacencies, i.e. the adjacencies whose addition to
the partial genome Xˆ would induce a cycle in the partial genome graph. The final set of
adjacencies added to Xˆ in the second loop connects all the remaining extremities in such
a way that no excluded adjacencies are added to Xˆ and one final cycle is added to the
adjacency graph.
Theorem 10 (Restricted Completion). Given a circular-unichromosomal genome A and
partial genome X on N genes, an optimal solution to the restricted block ordering problem
Xˆ of X will exhibit the DCJ distance:

                                       | M(X ) |
                                                                      
        dDCJ (A, X ) = N − c(AGA,X ) +
                 ˆ                               − c(GX ∪ U(A,X ) ) + 1 ,               (4.7)
                                          2

where c(AGA,X ) is the number of cycles in the partial adjacency graph AGA,X and
c(GX ∪ U(A,X ) ) is the number of cycles in the genome graph for the complete genome X ∪
U(A, X ).

4.6   Future Directions

Traditionally, ESP reads are assumed to represent clones of a single cancer genome se-
quence. But current ESP sequencing technology uses approximately 1 µg physical DNA
sample in order to generate and sequence clones. Even if the DNA extracted for use in
an ESP experiment represents molecules from a single tissue sample from a single patient,
there is no guarantee that all of the DNA comes from a single genome. In particular, humans
contain diploid genomes, so clones could be made from either copy of a patient’s genome.
Moreover, because cancer is characterized by a progressive series of somatic mutations, a
single tissue might contain many differently mutated versions of the cancer genome. As a
result, it is possible that a set of ESP reads represent clones taken from multiple different
cancer genomes even if they come from the same tissue sample. Therefore, it is reasonable
to take this assumption into consideration when characterizing tumor rearrangements using
ESP data.

Here we formalize the problem of inferring a set of differently mutated cancer genomes
from a set of measured adjacencies. As in our description of the completion problem, we
                                                                                                74

assume that our measured data is incomplete, representing some set of breakpoints (i.e.
adjacencies) that are known to exist in the cancer sample but we do not assume that the set
of measured adjacencies from our unknown tumor sample is comprehensive.
Definition 33. A k-completion, Xˆ k , of X is a set of k different genomes such that X ⊆
S
  Xˆ k .

Again, let A denote a reference healthy genome. We represent a set Xˆ k of k cancer
genomes that represent mutations of a healthy genome as a rooted tree T ˆk on Xˆ k ∪ {A},X
rooted at A. We call TXˆk a mixture tree. Given a distance metric, such as DCJ, the total
distance on a mixture tree TXˆk = (V, E) is given by:

                                                     X
                                 dDCJ (TXˆk ) =              dDCJ (u, v).                     (4.8)
                                                   (u,v)∈E


We now suggest the following parsimony-based problem.
Definition 34. Given a set of measured adjacencies, X , and an integer k > 0, the k-
Mixture Problem is to find a k-completion such that dDCJ (TXˆk ) is minimum.

Again, we can distinguish between restricted and unrestricted versions of the problem.
Note that when k = 1, the (un)restricted k-mixture problem is equivalent to the (un)restricted
completion problem.

As a starting point, we consider here the k-mixture problem when k = 2. There are exactly
two different (unlabeled) rooted tree topologies on 3 nodes, namely the tree comprised of
a root and two daughter nodes, that we shall refer to as the branch topology, and the tree
comprised of a root, one internal node, and one leaf, that we shall refer to as the path
topology.

First, we provide a motivating example to show that both topologies must be considered
when solving an instance of the k-mixture problem for k = 2.

Consider the example on N = 4 genes with reference genome and with measured adjacen-
cies as follows:

             A = {{ah , bt }, {bh , ct }, {ch , dt }, {dh , at }} ,
                                                                                              (4.9)
             X = {{ah , bh }, {bt , ct }, {ch , dh }, {ah , ct }, {bh , ch }, {bt , dt }} .

Suppose we are interested in the restricted version of the problem wherein the k-completion
is required to be comprised of circular-unichromosomal genomes. In this example, there is
                                                                                                          75

only one way to partition the set of adjacencies into two sets representing partial genomes
(i.e. without repeated extremities), namely, the partition defined by the first three elements
listed in X above and the second three elements listed in X . That partition defines two par-
tial genomes for which there exist unique completions. The resulting completions are, re-
spectively, B = {{ah , bh }, {bt , ct }, {ch , dh }, {dt , at }} and C = {{ah , ct }, {bh , ch }, {bt , dt },
{dh , at }}. The mixture tree on B and C that corresponds to the branch topology admits a
total tree-distance of dDCJ T = dDCJ (A, B)+dDCJ (A, C) = 2+2 = 4. There are two pos-
sible labelings of the nodes in a mixture tree corresponding to the path topology. The first
labeling admits a total tree-distance of dDCJ T = dDCJ (A, B) + dDCJ (B, C) = 2 + 3 = 5.
The second labeling admits a total tree-distance of dDCJ T = dDCJ (A, C) + dDCJ (C, B) =
2 + 3 = 5 as well. Therefore, in this example, a tree with the branch topology admits a
more parsimonious mixture than a tree with the path topology.

Conversely, consider the example with the same reference genome A and with measured
adjacencies as follows:

               A = {{ah , bt }, {bh , ct }, {ch , dt }, {dh , at }} ,
                                                                                                      (4.10)
               X = {{ah , bh }, {ct , dt }, {ch , at }, {ah , ch }, {bh , at }, {bt , dt }} .
In this example, there are several different partitions of X that represent two partial genomes,
and those partial genomes admit several different completions. However, an optimal mix-
ture with a branch topology admits a total score of 5 whereas an optimal mixture with a
path topology admits a total score of 4. Therefore, in this example, a tree with the path
topology admits a more parsimonious mixture than a tree with the branch topology.

We hope to characterize the instances in which the two tree topologies for k = 2 admit
solutions with different scores. Given a set of measured adjacencies, and a tree topology,
then, we hope to devise an efficient algorithm to construct the optimal k-completion of
X . We expect that the algorithms given in the previous section will be the basis for any
algorithms we devise to solve the k-mixture problem.

As a first step toward characterizing the inputs to the 2-mixture problem for which one of
the two possible tree topologies admits a more parsimonious solution than the other, we
note that for a reference genome A, a set of adjacencies X , and a 2-completion comprised
of genomes B and C, of X , we can directly determine the best topology if we know the
pairwise distances between A, B, and C, respectively. In particular, the total distance on a
tree with a branch topology is equal to dDCJ (A, B) + dDCJ (B, C). By contrast, the distance
on a tree with path topology is equal to either dDCJ (A, B) + dDCJ (B, C) or dDCJ (A, C) +
                                                                                             76

dDCJ (C, B), depending on the placement of genomes at the nodes in the tree. This gives us
the following lemma.
Lemma 11. Given a circular-unichromosomal genome A on N genes and a set of adja-
cencies X in which each extremity appears no more than twice, let genomes B and C be a
2-completion of X . If dDCJ (B, C) < dDCJ (A, C) and dDCJ (B, C) < dDCJ (A, B), then a
path topology is more parsimonious than a branch topology. If dDCJ (A, B) < dDCJ (B, C)
and dDCJ (A, C) < dDCJ (B, C), then a branch topology is more parsimonious than a path
topology.

Therefore, given a pair of partial genomes, we can construct a 2-completion in linear
time using one of the approaches described in Sections 4.5.1 and 4.5.2. And given a 2-
completion of a set of adjacencies, we can decide which tree topology admits a more par-
simonious solution in linear time. However, it remains an open question as to whether the
there exists an efficient method for finding a partition of a set of adjacencies X into two
partial genomes whose completions will admit an optimal mixture tree.

In some cases, we find that a single genome completion of a set of adjacencies will be
as parsimonious as any k-completion for k > 1, indicating that the measured adjacencies
were most likely taken from a single tumor genome instead of from a mixture. We note
that if a set of adjacencies contains some extremity more than once then it is not possible
to construct a single genome that contains all the adjacencies represented. But in the case
that a set of adjacencies X contains each extremity at most once, we can show that, in the
unrestricted case, any 2-completion of X on a tree with branch topology is no better than a
(single genome) completion.
Lemma 12. Given a circular-unichromosomal genome A on N genes and a set of adja-
cencies X in which each extremity in A appears no more than once, let G = {X1 , X2 } be a
2-completion of X with branch topology and let Xˆ be an optimal, unrestricted completion
of X . Then, dDCJ (TG ) ≥ dDCJ (A, Xˆ ).

Proof: First, we note that X1 and X2 partition the adjacencies in X , so | X |=| X1 | + |
X2 |. Moreover, for a partition {X1 , X2 } of X , the total number of cycles in their respective
partial adjacency graphs cannot exceed the number of cycles in the partial adjacency graph
for X , i.e. c(AGA,X ) ≥ c(AGA,X1 ) + c(AGA,X2 ). This is because the set of cycles in
AGA,X1 and the set of cycles in AGA,X2 are both disjoint subsets of the set of cycles in
AGA,X .

Now, in order to show the lemma, we must show that the total distance dDCJ (A, X1 ) +
                                                                                            77

dDCJ (A, X2 ) is at least as much as the distance dDCJ (A, Xˆ ). By Thm 8, we have that
dDCJ (A, Xˆ ) =| X | −c(AGA,X ). Thus, we can show the following bound.

                  dDCJ (A, Xˆ ) ≤ dDCJ (A, X1 ) + dDCJ (A, X2 ),
             | X | −c(AGA,X ) ≤ | X1 | −c(AGA,X1 )+ | X2 | −c(AGA,X2 ),
             | X | −c(AGA,X ) ≤ | X | − (c(AGA,X1 ) + c(AGA,X2 )) ,
                     c(AGA,X ) ≥ c(AGA,X1 ) + c(AGA,X2 ).

                                                                                            
We can extend this proof to any set of adjacencies in which each extremity appears no
more than d adjacencies; given such a set of adjacencies, an unrestricted d-completion
with “branch topology” (i.e. a star graph with d daughter nodes) will admit at least as
parsimonious a solution with respect to total DCJ distance on edges as any d+1 completion
with branch topology.

We believe that the techniques we introduced in this chapter for addressing the block or-
dering problem with respect to DCJ distance using the adjacency graph data structure is
an intuitive and simple framework. We conjecture that it may be possible to extend this
framework to address the k-mixture problem with respect to DCJ distance.

The k-mixture problem, however, might also prove interesting if we consider different
measures of parsimony, such as reversal distance. As we pointed out at the beginning of
this chapter, DCJ distance is a pretty good approximation for reversal distance between a
pair of signed genomes. However, there are examples for which DCJ distance is strictly
less than reversal distance. Due to this discrepancy, we cannot merely extrapolate the result
from Lemma 12 to a similar problem where the measure of parsimony is reversal distance.
Consider the following example.

Suppose that the reference genome A is now linear-unichromosomal and is the identity per-
mutation on 6 genes. Let X = {{1h , 4h }, {2t , 5t }, {3h , 6h }, {4t }}. (Note that X contains
a telomere.) The genome Xˆ = −5+2+3−6+1−4 is the completion (represented now as a
signed string instead of as a set of adjacencies and telomeres) that minimizes the reversal
distance to A: drev (A, Xˆ ) = 3. For k = 2, a most-parsimonious k-completion on a tree
with branch topology is G = {B = (+1 −4 −3 −2 +5 +6), C = (+1 +2 +3 −6 −5 −4)}.
The optimal mixture tree with branch topology for the linear-unichromosomal case with
respect to reversal distance is given in Fig. 4.7. The total number of reversals on the tree is
                                                                                           78


                                           (1, 2, 3, 4, 5, 6)


                                d=1                             d=1


                        (1, -4, -3, -2, 5, 6)               (1, 2, 3, -6, -5, -4)


Figure 4.7: A mixture tree T on genomes B = (+1, −4, −3, −2, +5, +6) and C =
(+1, +2, +3, −6, −5, −4). The distances from the root to each of its children are: d(A, B) =
d(A, C) = 1, and thus drev (T ) = 2.


only 2, which is less than the total number of reversals between A and the optimal (single
genome) completion Xˆ .

The k-completion problem seems related to the well-studied problem of constructing a phy-
logenetic tree to represent a rearrangement history for a set of known genomes of common
ancestry. For example, [14] and [22] consider the problem of computing a phylogenetic
tree for a set of known genomes by minimizing the total breakpoint distance on the tree,
and [16] consider the a similar problem but minimize the total reversal distance on the
tree. However, in the phylogenetic tree problem, the leaves of the tree are a set of known
genomes and the goal is to compute a set of unknown ancestral genomes that represent
internal nodes in the tree. We, instead, are interested in constructing the genomes at all the
nodes in the tree from impartial data.

For a set of adjacencies and telomeres X and an integer k, finding a most-parsimonious
mixture of k of cancer genomes amounts to partitioning X into k sets such that we may
construct k genomes, each containing some subset of the elements in X . There are expo-
nentially many ways to do this; in particular, there are ki=1 S(| X |, i) different ways,
                                                        P

where S(n, k) is the Stirling number of the second kind. Then given an integer k, there are
(k + 1)k−1 different possible labeled trees on k + 1 nodes and thus as many mixture trees
on k permutations. Thus, an exhaustive search procedure could not find an optimal mixture
tree efficiently.
                        BIBLIOGRAPHY

 [1] Max A. Alekseyev and Pavel A. Pevzner. Whole genome duplications and contracted
     breakpoint graphs. SICOMP, 36(6):1748–1763, 2007.

 [2] C. Alkan, J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari,
     J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, S. C. Sahinalp, R. A. Gibbs, and
     E. E. Eichler. Personalized copy number and segmental duplication maps using next-
     generation sequencing. Nat. Genet., 41:1061–1067, 2009.

 [3] Can Alkan, Evan E. Eichler, Jeffrey A. Bailey, S. Cenk Sahinalp, and Eray Tuzun. The
     role of unequal crossover in alpha-satellite DNA evolution: a computational analysis.
     J Comp Biol, 11(5):933–944, 2004.

 [4] David A. Bader, Bernard M.E. Moret, and Mi Yan. A linear-time algorithm for com-
     puting inversion distance between signed permutations with an experimental study.
     Journal of Computational Biology, 8(5):483–491, 2001.

 [5] V. Bafna and P.A. Pevzner. Genome rearrangements and sorting by reversals. Foun-
     dations of Computer Science, Annual IEEE Symposium on, 0:148–157, 1993.

 [6] J.A. Bailey and E.E. Eichler. Primate segmental duplications: crucibles of evolution,
     diversity and disease. Nat. Rev. Genet., 7:552–564, 2006.

 [7] J.A. Bailey, Z. Gu, R.A. Clark, K. Reinert, R.V. Samonte, S. Schwartz, M.D. Adams,
     E.W. Myers, P.W. Li, and E.E. Eichler. Recent segmental duplications in the human
     genome. Science, 297:1003–1007, 2002.

 [8] Ali Bashir, Chun Ye, Alkes L. Price, and Vineet Bafna. Orthologous repeats and
     mammalian phylogenetic inference. Genome Research, 15(7):998–1006, 2005.

 [9] Anne Bergeron, Julia Mixtacki, and Jens Stoye. A unifying view of genome rear-
     rangements. Algorithms in Bioinformatics, pages 163–173, 2006.

[10] Piotr Berman and Sridhar Hannenhalli. Fast sorting by reversal. In Proc. 7th Ann.
     Symposium on Combinatorial Pattern Matching, pages 168–185, 1996.


                                           79
                                                                                      80

[11] Denis Bertrand, Mathieu Lajoie, and Nadia El-Mabrouk. Inferring ancestral gene
     orders for a family of tandemly arrayed genes. J Comp Biol, 15(8):1063–1077, 2008.

[12] M. Blanchette, G. Bourque, and D. Sankoff. Breakpoint phylogenies. In Genome
     Informatics Workshop (GIW), volume 8, pages 25–34, 1997.

[13] M. Blanchette, E.D. Green, W. Miller, and D. Haussler. Reconstructing large regions
     of an ancestral mammalian genome in silico. Genome Res., 14:2412–2423, 2004.

[14] Mathieu Blanchette, Takashi Kunisawa, and David Sankoff.           Gene order break-
     point evidence in animal mitochondrial phylogeny. Journal of Molecular Evolution,
     49(2):193–203, 1999.

[15] R. Blekhman, A. Oshlack, and Y. Gilad. Segmental duplications contribute to gene
     expression differences between humans and chimpanzees. Genetics, 182:627–630,
     2009.

[16] Guillaume Bourque and Pavel A. Pevzner. Genome-scale evolution: reconstructing
     gene orders in the ancestral species. Gen. Res., 12:26–36, 2002.

[17] Alberto Caprara. Formulations and hardness of multiple sorting by reversals. In
     RECOMB ’99: Proceedings of the third annual international conference on Compu-
     tational molecular biology, pages 84–93, New York, NY, USA, 1999. ACM.

[18] Kamalika Chaudhuri, Kevin Chen, Radu Mihaescu, and Satish Rao. On the tan-
     dem duplication-random loss model of genome rearrangement. In Proceedings of the
     Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
     564–570, New York, NY, USA, 2006. ACM.

[19] Xin Chen, Jie Zheng, Zheng Fu, Peng Nan, Yang Zhong, Stefano Lonardi, and Tao
     Jiang. Assignment of orthologous genes via genome rearrangement. IEEE/ACM
     Trans. Comp. Biol. Bioinformatics, 2(4):302–315, 2005.

[20] Zhi-Zhong Chen, Lusheng Wang, and Zhanyong Wang. Approximation algorithms
     for reconstructing theduplication history oftandemrepeats. Algorithmica, pages 1748–
     1763, 2008.

[21] F.D. Ciccarelli, C. von Mering, M. Suyama, E.D. Harrington, E. Izaurralde, and
     P. Bork. Complex genomic rearrangements lead to novel primate gene function.
     Genome Res., 15:343–351, 2005.
                                                                                     81

[22] M.E. Cosner, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, L.S. Wang, T. Warnow, and
     S. Wyman. An empirical comparison of phylogenetic methods on chloroplast gene
     order data in campanulaceae. Comparative Genomics (DCAF 2000), pages 104–115,
     2000.

[23] Nadia El-Mabrouk. Genome rearrangement by reversals and insertions/deletions
     of contiguous segments. In In Proc. 11th Ann. Symp. Combin. Pattern Matching
     (CPM00), volume 1848, pages 222–234, Berlin, 2000. Springer-Verlag.

[24] Nadia El-Mabrouk. Reconstructing an ancestral genome using minimum segments
     duplications and reversals. J. Comput. Syst. Sci., 65(3):442–464, 2002.

[25] Nadia El-Mabrouk and David Sankoff. The reconstruction of doubled genomes.
     SICOMP, 32(3):754–792, 2003.

[26] Olivier Elemento, Olivier Gascuel, and Marie-Paule Lefranc. Reconstructing the du-
     plication history of tandemly repeated genes. Mol Biol Evol, 19(3):278–288, 2002.

[27] Funda Ergun, S. Muthukrishnan, and S. Cenk Sahinalp. Comparing sequences with
     segment rearrangements. In In Proceedings of Foundations of Software Technology
     and Theoretical Computer Science, pages 183–194. Springer, 2003.
     ´ Gaul and Mathieu Blanchette. Ordering partially assembled genomes using gene
[28] Eric
     arrangements. In Guillaume Bourque and Nadia El-Mabrouk, editors, Compara-
     tive Genomics, volume 4205 of Lecture Notes in Computer Science, pages 113–128.
     Springer, 2006.

[29] Paolo Giudici and Robert Castelo. Improving markov chain monte carlo model search
     for data mining. Machine Learning, 50(1-2):127–158, 2003.

[30] Sridhar Hannenhalli and Pavel Pevzner. Transforming cabbage into turnip: polyno-
     mial algorithm for sorting signed permutations by reversals. In STOC ’95: Proceed-
     ings of the twenty-seventh annual ACM symposium on Theory of computing, pages
     178–189, New York, NY, USA, 1995. ACM.

[31] J.E. Horvath, C.L. Gulden, R.U. Vallente, M.Y. Eichler, M. Ventura, J.D. McPherson,
     T.A. Graves, R.K. Wilson, S. Schwartz, M. Rocchi, and E.E. Eichler. Punctuated
     duplication seeding events during the evolution of human chromosome 2p11. Genome
     Res., 15:914–927, 2005.
                                                                                      82

[32] Zhaoshi Jiang, Haixu Tang, Mario Ventura, Maria Francesca Cardone, Tomas
     Marques-Bonet, Xinwei She, Pavel A Pevzner, and Evan E Eichler. Ancestral re-
     construction of segmental duplications reveals punctuated cores of human genome
     evolution. Nature Genetics, 39:1361–1368, 2007.

[33] Crystal Kahn, Shay Mozes, and Benjamin Raphael. Efficient algorithms for analyzing
     segmental duplications with deletions and inversions in genomes. Algorithms for
     Molecular Biology, 5(1):11, 2010.

[34] Crystal L. Kahn, Borislav H. Hristov, and Benjamin J. Raphael. Parsimony and likeli-
     hood reconstruction of human segmental duplications. Bioinformatics, 26(18):i446–
     i452, 2010.

[35] Crystal L. Kahn, Shay Mozes, and Ben J. Raphael. Efficient algorithms for analyzing
     segmental duplications, deletions, and inversions in genomes. In WABI, volume 5724
     of Lecture Notes in Computer Science, pages 169–180. Springer, 2009.

[36] Crystal L. Kahn and Ben J. Raphael. Analysis of segmental duplications via duplica-
     tion distance. Bioinformatics, 24:i133–138, 2008.

[37] Crystal L. Kahn and Ben J. Raphael. A parsimony approach to analysis of human
     segmental duplications. In Pacific Symposium on Biocomputing, volume 14, pages
     126–137, 2009.

[38] P. M. Kim, H. Y. Lam, A. E. Urban, J. O. Korbel, J. Affourtit, F. Grubert, X. Chen,
     S. Weissman, M. Snyder, and M. B. Gerstein. Analysis of copy number variants and
     segmental duplications in the human genome: Evidence for a change in the process
     of formation in recent evolutionary history. Genome Res., 18:1865–1874, 2008.

[39] Mathieu Lajoie, Denis Bertrand, Nadia El-Mabrouk, and Oliver Gascuel. Duplication
     and inversion history of a tandemly repeated gene family. J. Comp. Bio., 14(4):462–
     478, 2007.

[40] E.V. Linardopoulou, S.S. Parghi, C. Friedman, G.E. Osborn, S.M. Parkhurst, and
     B.J. Trask. Human subtelomeric WASH genes encode a new subclass of the WASP
     family. PLoS Genet., 3:e237, 2007.

[41] T. Liu, B.M.E. Moret, and D.A. Bader. An exact linear-time algorithm for computing
     genomic distances under inversions and deletions, 2003.
                                                                                        83

[42] J.R. Lupski and P. Stankiewicz. Genomic disorders: molecular mechanisms for rear-
     rangements and conveyed phenotypes. PLoS Genet., 1:e49, 2005.

[43] Mark Marron, Krister M. Swenson, and Bernard M. E. Moret. Genomic distances
     under deletions and insertions. TCS, 325(3):347–360, 2004.

[44] J S McCaskill. The equilibrium partition function and base pair binding probabilities
     for rna secondary structure. Biopolymers, 29(6-7):1105–19.

[45] Aleksandar Milosavljevic, David Haussler, and Jerzy Jurka. Informed parsimonious
     inference of prototypical genetic sequences. In COLT ’89: Proceedings of the second
     annual workshop on Computational learning theory, pages 102–117, San Francisco,
     CA, USA, 1989. Morgan Kaufmann Publishers Inc.

[46] Bernard Moret, Bernard M. E, Stacia Wyman, David A. Bader, Tandy Warnow, and
     Mi Yan. A new implementation and detailed study of breakpoint analysis, 2001.

[47] Bernard M.E. Moret, Li-San Wang, Tandy Warnow, and Stacia K. Wyman. New
     approaches for reconstructing phylogenies from gene order data. Bioinformatics,
     17(suppl 1):S165–173, 2001.

[48] Sean O’Rourke, Noah Zaitlen, Nebojsa Jojic, and Eleazar Eskin. Reconstructing
     the phylogeny of mobile elements. In In Proceedings of the Eleventh Annual Con-
     ference on Research in Computational Biology (RECOMB 2007), pages 196–210,
     Berlin, 2007. Springer.

[49] I. Pe’er and R. Shamir. The median problems for breakpoints are NP-complete. In
     Elec. Colloq. on Comput. Complexity, volume 71, 1998.

[50] Pavel Pevzner. Computational molecular biology : an algorithmic approach. MIT
     Press, Cambridge, Mass., 2000.

[51] E. D. Pleasance, R. K. Cheetham, P. J. Stephens, D. J. McBride, S. J. Humphray,
     C. D. Greenman, I. Varela, M.-L. Lin, G. R. Ord´on˜ ez, G. R. Bignell, K. Ye, J. Ali-
     paz, M. J. Bauer, D. Beare, A. Butler, R. J. Carter, L. Chen, A. J. Cox, S. Edkins,
     P. I. Kokko-Gonzales, N. A. Gormley, R. J. Grocock, C. D. Haudenschild, M. M.
     Hims, T. James, M. Jia, Z. Kingsbury, C. Leroy, J. Marshall, A. Menzies, L. J. Mudie,
     Z. Ning, T. Royce, O. B. Schulz-Trieglaff, A. Spiridou, L. A. Stebbings, L. Sza-
     jkowski, J. Teague, D. Williamson, L. Chin, M. T. Ross, P. J. Campbell, D. R. Bentley,
                                                                                        84

     P. A. Futreal, and M. R. Stratton. A comprehensive catalogue of somatic mutations
     from a human cancer genome. Nature, 463:191–196, January 2010.

[52] A. L. Price, E. Eskin, and P. A. Pevzner. Whole-genome analysis of Alu repeat ele-
     ments reveals complex evolutionary history. Genome Res., 14:2245–2252, 2004.

[53] D. Sankoff, G. Leduc, N. Antoine, B. Paquin, B.F. Lang, and R. Cedergren. Gene or-
     der comparisons for phylogenetic inference: evolution of the mitochondrial genome.
     Proc. Natl. Acad. Sci. U.S.A., 89(14):6575–6579, 1992.

[54] David Sankoff.       Genome rearrangement with gene families.         Bioinformatics,
     15(11):909–917, 1999.

[55] David Sankoff and Mathieu Blanchette. The median problem for breakpoints in com-
     parative genomics. In COCOON ’97: Proceedings of the Third Annual International
     Conference on Computing and Combinatorics, pages 251–264, London, UK, 1997.
     Springer-Verlag.

[56] Krister M. Swenson, Mark Marron, Joel V. Earnest-Deyoung, and Bernard M. E.
     Moret. Approximating the true evolutionary distance between two genomes. J. Exp.
     Algorithmics, 12:1–17, 2008.

[57] K. Vandepoele, N. Van Roy, K. Staes, F. Speleman, and F. van Roy. A novel gene fam-
     ily NBPF: intricate structure generated by gene duplications during primate evolution.
     Mol. Biol. Evol., 22:2265–2274, 2005.

[58] G. A. Watterson, W. J. Ewens, T. E. Hall, and A. Morgan. The chromosome inversion
     problem. Journal of Theoretical Biology, 99(1):1 – 7, 1982.

[59] Sophia Yancopoulos, Oliver Attie, and Richard Friedberg. Efficient sorting of ge-
     nomic permutations by translocation, inversion and block interchange. Bioinformat-
     ics, 21(16):3340–3346, 2005.

[60] Yu Zhang, Giltae Song, Tom´as Vinar, Eric D. Green, Adam C. Siepel, and Webb
     Miller. Reconstructing the evolutionary history of complex human gene clusters. In
     Proc. 12th Int’l Conf. on Research in Computational Molecular Biology (RECOMB),
     pages 29–49, 2008.

[61] M. C. Zody, Z. Jiang, H. C. Fung, F. Antonacci, L. W. Hillier, M. F. Cardone, T. A.
     Graves, J. M. Kidd, Z. Cheng, A. Abouelleil, L. Chen, J. Wallis, J. Glasscock, R. K.
     Wilson, A. D. Reily, J. Duckworth, M. Ventura, J. Hardy, W. C. Warren, and E. E.
                                                                               85

Eichler. Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet.,
40:1076–1083, 2008.
     APPENDIX A: A DISCUSSION OF
                 ANCESTRAL GENOME
            RECONSTRUCTION USING
   DUPLICATIONS AND REVERSALS

The problem of computing a minimum number of duplication transposition and reversal
operations needed to transform an ancestral genome (with no duplicates) into a present-day
genome (with duplicates) has been studied by El-Mabrouk in [24]. To simplify the problem,
El-Mabrouk [24] devises a model of evolution in which she gives a method for solving
the general problem of computing minimum reversal/duplication transposition distance by
solving an exponential number of polynomially solvable subproblems.

In particular, the subproblem considered in [24] is that of computing the minimum rever-
sal/duplication transposition distance between a non-ambiguous ancestral genome H (that
contains every character at most once) and a semi-ambiguous genome G (in which no char-
acter appears more than twice). It is shown that the minimum number of reversals and
duplication transpositions needed to transform H into G can be computed by assuming the
rearrangements occur in two distinct phases. First, the non-ambiguous H undergoes a se-
ries of duplication transpositions, transforming it into an intermediate ancestor I having the
same character composition as G, but not necessarily in the same order. Then, the interme-
diate ancestor I evolves through a series of reversal operations, resulting in the present-day
genome G.

The problem of finding the reversal/duplication transposition distance between a semi-
ambiguous genome G and a non-ambiguous genome H then is merely the problem of
reconstructing a semi-ambiguous intermediate ancestor I that minimizes the sum of the
duplication transposition distance between any non-ambiguous ancestor H and I and the
                                             86
                                                                                             87

reversal distance between I and G.

In [24], the problem of computing the reversal/duplication transposition distance between a
non-ambiguous ancestral genome and a present-day semi-ambiguous genome G, RD(G),
is formalized as


                             RD(G) = min [R(G, I) + D(I)]                                  (A.1)
                                           I

where R(G, I) is the reversal distance between G and I, D(I) is defined as the number of
maximal, repeated substrings of I, and the minimum is taken over all possible permutations
I of the set of characters comprising G.

The algorithm given in [24], therefore, implies that the minimum number of duplication
transpositions needed to transform a non-ambiguous H into a semi-ambiguous I is equal to
the number of maximal repeated substrings of I. However, this simplification is incorrect.
Claim 7. Let H be a non-ambiguous string and let I be a semi-ambiguous string that
evolved from H through a series of duplication transpositions. D(I), the number of maxi-
mal, repeated substrings of I, is not necessarily equal to the minimum number of duplica-
tion transposition operations needed to transform H into I.
Proof: We provide the following counterexample. Consider the strings:

                                     H = abcdef g
                                      I = abdecdbcef g

In this example, the maximal repeated substrings of I are {d, e, b, c}, so D(I) = 4. How-
ever, a sequence of three duplication transposition operations would suffice to create I:
first, duplicate bc in one operation, then duplicate d and e in two separate operations.
                                                                                              

Note that in the example above, the duplication distance, d(H, I) is two. (See Chapter 2
for a description of duplication distance.) The only cases considered in [24] are cases in
which the ancestor genome H is non-ambiguous and the intermediate ancestor I is semi-
ambiguous. In these cases, the duplication distance is no greater than the duplication trans-
position distance.
Claim 8. If Y is semi-ambiguous and X is non-ambiguous, X is a subsequence of Y ,
and {Y } \ {X} = ∅ then d(X, Y ) ≤ DT (X, Y ), where DT (X, Y ) equals the number of
duplication transpositions needed to transform X into Y and {X} denotes the multiset of
characters that appear in Y (and similar for {X}).
                                                                                          88

Proof: Because Y is semi-ambiguous, no character of X is copied more than once in any
sequence of duplication transposition or duplicate operations building Y from X. There-
fore, all the strings that are copied in a minimum sequence of either duplication transposi-
tion or duplicate operations are substrings of the original input string X.

Consider a minimum-length sequence of duplication transpositions that transforms X into
Y . Because all the duplication transpositions in such a sequence copy substrings of the
original string X, there are corresponding duplicate operations, copying the same sequence
of substrings of X, giving rise to Y . However, as we noted above, the optimal sequence
of duplicate operations transforming X into Y may have strictly smaller length than that
of the optimal sequence of duplication transposition operations yielding the same Y . Thus,
d(X, Y ) ≤ DT (X, Y ).                                                                    

It follows that duplication distance may provide a simpler explanation than duplication
transposition distance in this subproblem of computing the minimum reversal/duplication
transposition distance between a non-ambiguous string and a semi-ambiguous string. More-
over, the computation implied in [24] for retrieving the ancestor genome H from the in-
termediate, semi-ambiguous genome, I, is to delete exactly one copy of every maximal
repeated substring of I. However, this may not yield an ancestor genome H such that
DT (H, I) is minimal. Consider, the example given in the proof of Claim 7. If we had
deleted the not-bolded copies of each of the repeated characters, we would construct the
ancestor genome H 0 = adebcf g. In this example, the duplication transposition distance
DT (H 0 , I) is equal to four, the number of repeats. However, H 0 is not the ancestor genome
that minimizes the duplication transposition distance. Therefore, the algorithm in [24]
not only incorrectly computes the optimal value of the duplication transposition distance
between I and some non-ambiguous H, it might also output an ancestor genome that is
non-optimal.

Now we return to the reversal/duplication transposition distance problem posed in [24].
Under the assumption that a semi-ambiguous genome G evolved from a non-ambiguous
genome H by a series of duplication transpositions followed by a series of reversals, the
formulation of Equation A.1 can be corrected if we change the definition of D(I) so that it
no longer equals the number of repeats in I. Instead, let


                                  D(I) = min DT (H, I)                                 (A.2)
                                           H∈A

where A is defined as the set of all non-ambiguous subsequences of I that are obtained by
                                                                                           89

deleting exactly one copy of every duplicated gene. Thus, if I has d duplicated genes, A
will contain 2d elements.

Unfortunately, there is no known algorithm for computing a minimum sequence of duplica-
tion transpositions needed to transform a non-ambiguous genome into one with duplicates.
The problem’s difficulty results from the fact that the set of substrings that can be dupli-
cated after the k th duplication transposition operation depends on the first k operations. On
the other hand, Claim 8 states that the duplication distance d(H, I) for any H ∈ A is a
lower bound for the duplication transposition distance.
               APPENDIX B: ANOTHER
          PROBABILISTIC MODEL OF
         SEGMENTAL DUPLICATIONS

Here we present an alternative generative probabilistic model of segmental duplications
that could be used in place of the probabilistic model presented in Section 3.2.1 to compute
a maximum likelihood evolutionary history DAG. Given a source string X and an integer
k > 0, the model here considers the weighted ensemble of all sequences of k duplicate
operations that could result in the construction of some target string. Here the order of the
duplicate operations is important; unlike in the model presented in Section 3.2.1, here we
distinguish sequences of duplicate operations in which the substrings of X are duplicated
in different orders even if the resulting target strings are identical. Then the likelihood of
a particular target string Y , given the source X and integer k, depends on the weighted
ensemble of all sequences of k duplicate operations that could be used to generate Y from
X. In principle, this model explicitly considers all possible duplication scenarios and is,
therefore, a generative model. Ultimately, we did not implement this model because the
recurrence relation–although polynomial-time–was prohibitively slow.

Again, we assume that duplication blocks, represented as signed strings on an alphabet of
duplicons, are built up from other duplication blocks through successive rounds of duplicate
operations (see Def. 4). Recall the following definition.
Definition 17. Given a source string X, a generator ΨX = (Xi1 ,j1 , . . . , Xik ,jk ) is a se-
quence of substrings of X.

Here, we redefine what it means for a generator to be feasible for a particular target string.
As before, we define a generator to be feasible for a target string if the constituent substrings
partition the target into mutually nonoverlapping subsequences. However, we now require
the order of substrings that comprise the generator to correspond to an order in which they
                                               90
                                                                                                  91

                                          X=                           abcde
                                          Y0 =                             ∅
                                          Y 1 = Y 0 ◦ δX (1, 3, 1) =     abc
                                          Y 2 = Y 1 ◦ δX (4, 5, 1) =   deabc
                                           Y = Y 2 ◦ δX (4, 5, 5) = deabdec


Figure B.1: An example of a sequence of duplicate operations that constructs Y = deabdec from
X = abcde. A feasible generator for Y is: ΨX = (X1,3 , X4,5 , X4,5 ) = ((abc), (de), (de)).


may have been duplicated from the source to build the given target. In particular, we have
the following redefinition.
Definition 35. A generator ΨX = (Xi1 ,j1 , . . . , Xik ,jk ) is feasible for a target string Y ,
that we denote as ΨX a Y , if there exists a sequence of indices (t1 , . . . , tk ) such that
Y = ∅ ◦ δX (i1 , j1 , t1 ) ◦ · · · ◦ δX (ik , jk , tk ). (See Fig. B.1).

For a given source string X and positive integer k we consider the space of all length-k
generators ΨX . Again, we define a probability distribution on the collection of generators
                                                   (k)
and we compute the partition function ZX of the weighted ensemble of all possible length-
k generators. We define the event F as before: it is the event of choosing a length-k
generator that is feasible for Y from the space of all length-k generators. Thusly, we define
a probabilistic model that, given a target string Y , assigns a probability to F :


                                                         1             X
                            P r[F | Y, X, k] =           (k)
                                                                                ω(ΨX ) ,    (B.1)
                                                      ZX       ΨX aY :|ΨX |=k

where | ΨX | denotes the length of the generator and ω(ΨX ) is the weight assigned to a
generator. We assume the weight function has the same properties as in Section 3.2.2.
                                                                                      (k)
First, we review the algorithm to compute the partition function ZX . Because we have not
changed the definition of a generator, the partition function of the ensemble of all length-
k generators can be computed as before. Every length-k generator whose elements have
lengths that sum to l are scored the same (according to σ(k, l)), we can count the total
                                                                                            (k)
number of such generators and then multiply by the score function. Again, let CX (l)
equal the number of distinct length-k generators for which the sum of the lengths of the
                                                                                            (k)
elements equals l. Recall that we gave an O(| X | k)-time algorithm for computing CX (l)
in Lemma 5:
Lemma 5. Let X = x1 . . . x|X| be a source string and let k and l be positive integers. The
                                                                                              92

           (k)
function CX (l) satisfies the following recurrence.

                         (1)
                       CX (l) = | X | −l + 1,
                                  l−1
                                  X
                        (k)              (k−1)
                       CX (l) =        CX (l0 ) · (| X | −(l − l0 ) + 1).
                                   l0 =l−|X|


                                                                       (k)              (k)
For a source string X and integers k, l, if we are given CX (l), we can compute ZX ef-
                             (k)
ficiently by summing CX (l) over all relevant lengths l, weighting each feasible generator
appropriately according to the function σ(k, l). Therefore, again we have the theorem:
Theorem 5. Let X = x1 . . . x|X| be a source string and k be a positive integer. The
                       (k)
partition function ZX satisfies the following.
                                               |X|·k
                                    (k)
                                               X        (k)
                                   ZX     =            CX (l) · σ(k, l).
                                               l=k

                                                                             (k)
The recurrence in Lemma 5 can be computed in O(| X | k) time, so ZX can be computed
in O(| X |2 k 2 ) time according to Theorem 5.

Therefore, using the new probabilistic model, we can compute the partition function of
length-k generators for a given source string just as we did in Section 3.2.1.

However, the new definition of a feasible set requires that we augment our recurrence for
computing the restricted partition function Q(k) of feasible sets. Fortunately, there is an
easy extension we can make to do this. Since all length-k generators that are feasible for a
target string Y have lengths that sum to | Y |, we can score them all according to σ(k, | Y |).
We describe here a recurrence to compute the number of distinct length-k generators ΨX
that are feasible for a given string Y .
Lemma 13. Given a source string X = x1 . . . x|X| and a target string Y = y1 , . . . , y|Y | ,
                 (k)
the number ηX (Y ) of distinct length-k generators ΨX that are feasible for Y satisfies the
                                                                                                            93

following recurrence.
                     (k)
                                      X          (k)
                    ηX (Y ) =                   ηX (Y, i),
                                     i:xi =y1
                                     k
                                     X
                   (k)                      (k)
                  ηX (Y, i)      =         ηX (Y, i, d),
                                     d=1
                                     
                                     1         if Y = Xi,i+|Y |−1 ,
              (1)
            ηX (Y, i, d) =
                                     0         otherwise,
              (k)                     (k−1)
            ηX (Y, i, d) = ηX                   (Y2,|Y | ) +
                                           X           k
                                                       X       (l)         (k−l)
                                                           ηX (Y2,j−1 )ηX          (Yj,|Y | , i + 1, d) ·
                                     j>1:yj =xi+1 l=0
                                     l−1 
                                          l−1 k−l−d+1
                                     X              
                                                                              .
                                     s=0
                                                  s                  s+1
                                     (k)
For completeness, we define ηX (Y, i, d) = 0 for values d > k.

This lemma is the analog to Lemma 6. Here, though, we cannot simply count the number
of ways we can partition Y into k mutually nonoverlapping subsequences that correspond
to substrings of X – we must consider how such a set of nonoverlapping subsequences
might be ordered corresponding to a sequence of duplicate operations. Moreover, we must
distinguish between generators that are comprised of the same set of substrings of X but
that are ordered differently.
                           (k)
Intuitively, the value ηX (Y ) represents the number of length-k feasible generators for Y ,
            (k)
the value ηXi (Y, i) represents the number of length-k f feasible generators for Y such
                                                 (k)
that xi generates y1 , and the value ηX (Y, i, d) represents the number of length-k feasible
generators for Y such that xi generates y1 and this character xi appears in a substring of X
that is dth in the order of elements in the generator.

The recurrence given in Lemma 13 differs from that given in Lemma 6 in the inclusion
                     (k)
of the function ηX (Y, i, d). Fundamentally, the two additive terms in the definition of
 (k)
ηX (Y, i, d) correspond to two cases that are analogous to the two cases originally described
in the presentation of the duplication distance algorithm (see Thm. 1). In the first case, the
substring corresponding to the character xi generates the character at y1 in a duplicate
operation in which just a single character is copied, corresponding to an element of the
generator. In this case, the remaining suffix Y2,|Y | is generated in another k − 1 duplicate
operations. For every length-(k − 1) feasible generator for the suffix Y2,|Y | , we can insert
                                                                                              94

the substring x1 into the dth position in the ordering to yield a length-k feasible generator
for Y in which the substring xi generates y1 and is ordered dth in the generator. In the
second case, the character at y1 is generated in a duplicate operation along with another
character at yj (for some j > 1). In this case, the suffix of Y can be broken into two
independent subproblems: the substring Y2,j−1 and the suffix Yj,|Y | . A length-l feasible
generator for Y2,j−1 and a length-(k − l) feasible generator for Yj,|Y | can be combined to
yield a length-k feasible generator for Y . Moreover, if the generator for Yj,|Y | contains
a substring that begins at index i + 1 in X to generate yj and that substring is ordered
dth within that generator, then the combined length-k generator will have the necessary
properties; in particular, the generator will include some substring that begins at xi (and
also includes xi+1 and possibly successive characters as well) that generates the characters
y1 (and yj and possibly successive characters as well) and that substring will be ordered dth
in the generator.
                                                                (k)
Now, the last multiplicative term in the definition of ηX (Y, i, d) accounts for the number
of ways that we can construct a total order on k items that are partitioned into sets of l
and k − l items that are themselves, respectively, ordered. Note that the substring that
generates the characters at y1 and yj must appear dth in the combined total order and the
substrings that comprise the length-l feasible generator for Y2,j−1 (and therefore generate
subsequences of Y that are inside the subsequence containing y1 and yj ) must come after
the dth position in the ordering of all k elements. For an integer 0 ≤ s ≤ l − 1, a sequence
of l items can be split at s positions in l−1
                                               
                                             s
                                                 different ways, yielding s + 1 subsequences.
There are k − l − d + 1 positions that come after the dth position in between successive
elements in the sequence of k − l items comprising the feasible generator for Yj,|Y | . There
are k−l−d+1   ways of placing s + 1 subsequences into these k − l − d + 1 position to yield
            
      s+1
a totally ordered sequence of k items.
                                                          (k)
We can compute the restricted partition function QX (Y ) efficiently by first counting the
                                                    (k)
number of relevant feasible generators, namely ηX (Y ), and scoring each generator appro-
priately by σ(k, | Y |). This gives us the following theorem.
Theorem 11. Let X = x1 . . . x|X| , Y = y1 , . . . , y|Y | be a source/target string pair and let
                                                                      (k)
k be a positive integer. The restricted partition function QX (Y ) satisfies the following.
                                 (k)         (k)
                               QX (Y ) = ηX (Y ) · σ(k, | Y |).

                                                                            (k)
To compute the recurrence in Lemma 13, we must compute the value ηX (Y, i, d) for every
substring of Y , every value in {i | xi = y1 }, and every value d = 1, . . . , k; in total
                                                                                         95

 (k)
ηX (Y, i, d) must be computed O(| Y |2 ·µ(X) · k) times, where µ(X) is the maximum
                                                           (k)
multiplicity of any character in X. Each computation of ηX (Y, i, d) takes then O(µ(Y )k)
time. Thus, the recurrence in Lemma 13 can be computed in time O(| Y |2 µ(X)µ(Y )k 2 );
                       (k)
the time to compute QX (Y ) is the same. In the worst case, this is O(| Y |5 · | X |).
       APPENDIX C: A DISCUSSION OF
 THE BLOCK ORDERING PROBLEM
       USING A BREAKPOINT GRAPH
                                 FRAMEWORK

The Block Ordering Problem was originally introduced by Gaul and Blanchette in [28].
The authors present a solution to the problem of completing a pair of partially ordered
genomes so as to maximize the number of cycles in the resulting breakpoint graph. Recall
from Section 4.3 that maximizing the number of cycles in the breakpoint graph for a pair
of genomes is equivalent to maximizing the number of cycles in their adjacency graph.
Therefore, the algorithm presented in Section 4.5.2 and the algorithm presented in [28]
both produce a pair of complete genomes that satisfy the same optimality criterion.1

The solution in [28] begins with the construction of a fragmented breakpoint graph, a gen-
eralization of the breakpoint graph for a pair of genomes (see Section 4.3 for a description
of a breakpoint graph). The nodes in a fragmented breakpoint graph for a pair of partial
genomes on N genes correspond to the 2N extremities, and as in the breakpoint graph, the
bi-colored edges between nodes correspond to adjacencies in either of the two genomes
with each color corresponding to adjacencies in one of the two genomes. Obverse edges
are omitted. However, because a partially assembled genome may exhibit fewer than N
adjacencies, each extremity in the fragmented breakpoint graph may exhibit fewer than
two neighbors. Thus, a fragmented breakpoint graph is comprised of a collection of color-
alternating simple cycles and color-alternating simple paths. The process of completing the
   1
    Note that the block ordering problem, as presented in [28], requires that the completed genomes be linear-
unichromosomal. Although the algorithm in Section 4.5.2 constructs completed genomes that are circular-
unichromosomal, the algorithm can be adapted easily to construct linear genomes instead by including a pair
of odd-length paths in the final adjacency graph.

                                                     96
                                                                                          97

genomes optimally will result in the transformation of the collection of simple paths in the
fragmented breakpoint graph into a set of simple cycles whose cardinality is maximum.

In order to complete the genomes optimally in [28], the authors construct a block ordering
graph from the fragmented breakpoint graph. In this graph, the only vertices represented
correspond to those with degree zero or one in the fragmented breakpoint graph (i.e. those
for which adjacencies must be ascribed in order to complete the pair of genomes). Edges in
the block ordering graph are either dashed or solid. Dashed edges connect pairs of vertices
that appear at opposite ends of a simple path in the fragmented breakpoint graph, and solid
edges connect pairs of vertices that appear at opposite ends of a block of adjacencies in one
partial genomes.

The authors distinguish between different types of components in the block ordering graph
and prescribe a method for “processing” each type of component. In particular, they iden-
tify components comprised entirely of solid edges corresponding to blocks from a sin-
gle partial genome and dashed edges corresponding to paths in the fragmented breakpoint
graph whose starting and ending vertices appear at the ends of blocks from the same par-
tial genome, the so-called one-sided components. They note that the dashed edges can be
“processed” by joining together the two endpoint vertices, creating a new adjacency in the
genome and effectively merging together two blocks into one larger contig. Unfortunately,
they note that “not all [such] edges can simultaneously be ‘closed’ that way, because this
may lead to an invalid solution: since each [such] component is an alternating cycle in the
[block ordering graph], closing all its dashed edges would correspond to joining all the
corresponding block ends, ultimately resulting [in] a cycle of blocks.” A cycle of blocks
in a genome would correspond to a circular contig that cannot be merged with any other
blocks by adding new adjacencies which is invalid. They then note that “the good news is
that any [such dashed] edge can be sacrificed and the resulting partial ordering of blocks
can be inserted anywhere in the complete orderings, without changing the score of the so-
lution...” This relies on showing that for any pair of partial genomes whose block ordering
graph exhibits a set of ω one-sided components with a total of lα dashed edges among
them, there cannot exist a linear-unichromosomal completion of the genomes that exhibit
more than lα − ω new cycles derived from the processing of the dashed edges in one-sided
components. However, this is not explicitly shown in [28].

In particular, if we suppose that a block ordering graph is comprised of ω one-sided com-
ponents with a total of lα dashed edges among them, then there must exist (by definition)
                                                                                             98

lα alternating paths in the fragmented breakpoint graph. Therefore, a naive upper bound
on the number of new cycles that can be constructed by closing those paths in the frag-
mented breakpoint graph is lα . Instead, the strategy described in [28] produces lα − ω new
cycles in the breakpoint graph by processing dashed edges. In the supplement to [28], the
authors provide, in Lemma 2, that it is possible to construct a solution in which there are
lα − ω new cycles in the fragmented breakpoint graph that result from processing one-sided
dashed edges, but they do not show explicitly that lα − ω is an upper bound.

Note that in proving the optimality of our algorithm for the restricted block ordering prob-
lem in Section 4.5.2, we do prove an analog of the necessary lemma that is not explicitly
shown in [28]. In Lemma 10, we state that the maximum number of cycles that can be
added to a partial adjacency graph for a pair of partial genomes by completing the genomes
                                                                   |M(X )|
is bounded by the number of missing adjacencies divided by two,      2
                                                                           ,   (an analog of the
number of dashed edges in the block ordering graph) minus the number of cycles in the
genome graph defined by the genome obtained by augmenting the partial genome X with
the set of unsatisfied pairs, c(GX ∪U (A,X ) ) (the analog of the number of one-sided compo-
nents in the block ordering graph). By then providing an algorithm that achieves this upper
bound, we complete the proof of optimality stated in Thm. 10.