Analysis of an Interacting Particle Method for
            Rare Event Estimation


                                by
                              Yi Cai
           M.S., Brown University, Providence, RI, 2007
           B.S., Peking University, Beijing, China 2006
           B.A., Peking University, Beijing, China 2006


       A Dissertation Submitted in Partial Fulfillment of the
       Requirements for the Degree of Doctor of Philosophy
    in the Division of Applied Mathematics at Brown University


                     Providence, Rhode Island


                             May 2012
Abstract of “Analysis of An Interacting Particle Method for Rare Event Estimation”
by Yi Cai, Ph.D., Brown University, May 2012


This thesis is a large deviations study for the performance of an interacting particle
method for rare event estimation. The analysis is restricted to a one-dimensional
setting, though even in this restricted setting a number of new techniques must
be developed. In contrast to the large deviations analysis of related algorithms, for
interacting particle schemes it is an occupation measure analysis that is relevant, and
within this framework many standard assumptions (stationarity, Feller property) can
no longer be assumed. The methods developed are not limited to the question of
performance analysis, and in fact give the full large deviations principle for such
systems.

c Copyright 2012 by Yi Cai
         This dissertation by Yi Cai is accepted in its present form
          by the Division of Applied Mathematics as satisfying the
       dissertation requirement for the degree of Doctor of Philosophy.


Date
                                     Paul Dupuis, Ph.D., Advisor


                   Recommended to the Graduate Council


Date
                                     Hongjie Dong, Ph.D., Reader


Date
                                   Kavita Ramanan, Ph.D., Reader


                     Approved by the Graduate Council


Date
                              Peter Weber, Dean of the Graduate School


                                      iii
To My Parents


     iv
                                 The Vita of Yi Cai


Yi Cai was born in Nanchang, Jiangxi province, China on January 8, 1985.


In 2002, she started her undergraduate study at Peking University, and received both
a Bachelor of Science degree in Theoretical Mechanics and a Bachelor of Arts degree
in Economics four years later.


She came to the United States in 2006 and has been attending the Ph.D. program
in the Division of Applied Mathematics at Brown University. During the Ph.D.
program, she received a Master of Science degree in Applied Mathematics in 2007.
This dissertation was defended in April 27, 2012.


                                         v
                                   Acknowledgments


First and foremost, I would like to express my deepest gratitude to my advisor Prof.
Paul Dupuis. This dissertation would have not been possible without his support,
understanding and encouragement. Coming from a non-mathematical background,
I started out with not much knowledge in large deviations and few experience in
scientific research. Prof Dupuis, with unsurpassed patience, helped me overcome
those challenges and pointed me to the right direction. His rigorous attitude towards
scientific research have been a tremendous influence on me and shaped me profoundly.
I am very fortunate to have him as a mentor.
   I would also like to thank my other committee members Prof. Hongjie Dong
and Prof. Kavita Ramanan for their helpful suggestion and comments during the
completion of this dissertation.
   I am grateful to the Division of Mathematics and graduate school of Brown
University for generous financial support in the past six years. I also want to thank
all of my friends in Brown for helping me in every aspect. My special appreciation
goes to Jingjing Wang for her friendship and support, which helped me overcome
many crisis situations.
   Last but not least, I would like to thank my parents for their unconditional love.
They have always been supportive of everything of me even though they are at the
other side of the world.


                                         vi
Contents

Acknowledgments                                                                          vi

0 Introduction and Preliminaries                                                          1
     0.1   Monte Carlo Simulation and Asymptotic Optimality . . . . . . . . . .           1
     0.2   Beyond Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . .        4
     0.3   Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   10


I    Problem Formulation and Main Results                                                16

1 Notation and Problem Formulation                                                       17
     1.1   Process Model and the Event of Interest . . . . . . . . . . . . . . . .       17
     1.2   Interacting Particle System (IPS) Estimator . . . . . . . . . . . . . .       19
     1.3   Reformulation as an Occupation Measure Problem . . . . . . . . . . .          22
     1.4   First Entrance Distributions and the Feller Property . . . . . . . . .        23

2 Statement of the Main Results                                                          28


II     Large Deviation Analysis                                                          30

3 Relative Entropy Representation                                                        31
     3.1   The Representation Formula . . . . . . . . . . . . . . . . . . . . . . .      31
     3.2   The Limiting Behavior of the Underlying Transportation Kernel . . .           37


                                             vii
4 The Large Deviation Upper Bound                                                       42
  4.1    Notation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . .     42
  4.2    Tightness and Characterization of Limits . . . . . . . . . . . . . . . .       46
  4.3    Proof of the Lower Bound (2.2) . . . . . . . . . . . . . . . . . . . . .       57

5 The Large Deviation Lower Bound                                                       60
  5.1    Some Important Assumptions and Ergodicity of the Limit Underlying
         Transition Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    60
  5.2    The Construction of Ergodic IPS Transition Kernels . . . . . . . . . .         63
  5.3    The Construction of Ergodic Controls . . . . . . . . . . . . . . . . . .       70
  5.4    Proof of the Upper Bound (2.3) . . . . . . . . . . . . . . . . . . . . .       77
  5.5    Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     85

6 Asymptotics of the Decay Rate as N → ∞                                                95
  6.1    Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . .     98


III     Numerical Results                                                              101

7 Computational Results                                                                102


Bibliography                                                                           113


Index                                                                                  116


                                          viii
List of Figures

 7.1   large deviation rate of the second moment of the interacting particle
       system estimator for different m and N . . . . . . . . . . . . . . . . . 104
 7.2   trade-off between N and K with fixed effort, N × K = 1, 000, 000, m
       = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
 7.3   trade-off between N and K with fixed effort, N × K = 1, 000, 000, m
       = 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
 7.4   trade-off between N and K with fixed effort, N × K = 1, 000, 000, m
       = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
 7.5   sample error, m = 10, fixed K . . . . . . . . . . . . . . . . . . . . . . 109
 7.6   sample error, m = 50, fixed K . . . . . . . . . . . . . . . . . . . . . . 109
 7.7   standard error, m = 10, fixed K . . . . . . . . . . . . . . . . . . . . . 110
 7.8   standard error, m = 50, fixed K . . . . . . . . . . . . . . . . . . . . . 110
 7.9   standard error, m = 10, fixed N . . . . . . . . . . . . . . . . . . . . . 111
 7.10 decay rate in N, m = 10, fixed K . . . . . . . . . . . . . . . . . . . . 111
 7.11 decay rate in N, m = 50, fixed K . . . . . . . . . . . . . . . . . . . . 112


                                         ix
Chapter 0

Introduction and Preliminaries

0.1      Monte Carlo Simulation and Asymptotic
         Optimality
A basic and common technique for the estimation of probabilities and functionals of
probability measures is the standard Monte Carlo simulation. Let X be a random
variable taking values in R distributed according to µ ∈ P(R). To estimate the
expectation of X, denoted by EX, one can generate K number of independently and
identically distributed (iid) samples from µ, say X0 , X1 , . . . , XK−1 , and the estimate
for EX is just the sample mean:


                                       . 1 X
                                           K−1
                                    QK =       Xi .
                                         K i=0

It is easy to check that this estimator is unbiased. Furthermore, Law of Large Num-
bers (LLN) ensures that this estimate converges to the correct value as the number
of samples K goes to ∞. The sample size K can be chosen by some predetermined
precision requirement.
   However, when the events of interest are of small probabilities (termed rare


                                            1
                                                                                         2


events), the standard Monte Carlo simulation will require an enormous number of
samples for the variance of the resulting estimate to be comparable to the unknown
probability. To be precise, we introduce the large deviation methodology.
   We begin by giving the definition of rate function and large deviation principle.
Let X be a complete separable metric space, i.e. a Polish space.

Definition 0.1 (Rate Function). A function I on X is called a rate function on X ,
or simply a rate function, if I maps X into [0, ∞] and if for each M < ∞ the level
set {x ∈ X : I(x) ≤ M } is a compact subset of X .

   Let {X m , m ∈ N} be a sequence of random variables defined on a probability
space (Ω, F, P) and taking values in a polish space X .

Definition 0.2 (Large deviation Principle). Let I be a rate function on X . The
sequence {X m } is said to satisfy the large deviation principle on X with rate function
I if the following two conditions hold.

 (a) Large deviation upper bound. For each closed subset F of X

                                    1
                          lim sup     log P {X m ∈ F } ≤ − inf I (x) .
                           m→∞      m                      x∈F


 (b) Large deviation lower bound. For each open subset G of X

                                    1
                          lim inf     log P {X m ∈ G} ≥ − inf I (x) .
                           m→∞      m                     x∈G


   With the definition of large deviation principle in hand, we now give a concrete
example of the ill-performance of standard Monte Carlo for rare events simulation,
but also keep in mind that the issues hold in general.
   Let Y 0 , Y 1 , . . . be a sequence of iid Rd -valued random variables with distribution
                                                          R
µ, and assume the moment generating function Rd exphα, yiµ(dy) is finite for all
α ∈ Rd . Let S m = Y 0 + Y 1 + · · · + Y m−1 . For Borel sets A ⊂ Rd , Cram´er’s
                                                                                        3


Theorem [6, Theorem 2.2.3] guarantees a large deviation principle for S m /m. Let
   .
pm = P(S m /m ∈ A). We also define the indicator function by
                                          
                                           1   x∈A
                                 1A (x) =               .
                                           0 otherwise


Now let’s consider the Monte Carlo approximation for pm for some fixed value m. In
terms of the notation introduced previously, we have X = 1A (S m /m), the straight-
forward Monte Carlo would require the generation of many iid copies of X (i.e., of
S m ). The variance of QK is given by
                                    "         m #
                                    1 X
                                      K−1
                                              Si      pm (1 − pm )
                    var [QK ] = var       1A        =              .
                                    K i=0     m            K

If pm → 0 as m → ∞, this quantity approaches zero. However, the relative error is
                                                              p
                              . standard deviation of QK       pm (1 − pm )
               relative error =                          =      √           .
                                      mean of QK                  Kpm
        p
Since       pm (1 − pm )/pm → ∞, a large sample size (i.e., K) is required for the estima-
tor to achieve a reasonable relative error bound (at least when m is large). In fact,
since pm scales according to a large deviation principle, K must grow exponentially
in m if a bounded relative error is to be maintained. As a result, standard Monte
Carlo simulation is rarely used to estimate rare event probabilities and more efficient
alternatives are explored.
   Before introducing other more “efficient” schemes, a natural question to ask is:
What is a mathematical criteria for “efficient”? The above discussion suggests a
solution: for fixed m, the faster the variance decays, the smaller sample size K is
required. We define those schemes that achieve the “fastest” rate of convergence
as asymptotically optimal. The formal mathematical definition is formulated as
following.
                                                                                        4


Asymptotic Optimality

Assume random variables {X m , m = 0, 1, . . .} satisfies a large deviation limit:

                                     1
                                  lim  log EX m = −ρ,
                                 m→∞ m


                                                                           ¯m
for some ρ ≥ 0. In order to estimate EX m , an alternative random variable X
                                 ¯ m . Similarly to standard Monte Carlo, one
is constructed such that EX m = EX
generates iid replications (X              ¯ m ) of X
                            ¯ 0m , . . . , X        ¯ m and obtain an unbiased estimator
                                             K−1

by averaging:
                                             X
                                             K−1
                                     ¯ m . 1     ¯ im .
                                     QK =        X
                                           K i=0

The rate of convergence associated with this estimator is determined by the variance
of the summands. Since unbiasedness implies that the first moment is fixed, it suffices
                                ¯ 0m )2 . However, it follows from Jensen’s inequality
to consider the second moment E(X
that
                            1                               
                lim sup −           ¯ m 2 ≤ lim − 1 log EX
                              log E X                    ¯ m 2 = 2ρ.                 (0.1)
                                      0                    0
                 m→∞        m              m→∞   m
              ¯ m is said to be asymptotically optimal if this upper bound is achieved,
The estimator Q K

i.e. if
                                        1           
                                lim −           ¯ m 2 = 2ρ.
                                          log E X 0
                               m→∞      m
    This thesis studies the performance of one of those more efficient schemes for rare
event simulation, namely, interacting particle scheme. The above concept will be the
focus of the analysis. We will come back to the definition of asymptotic optimality
in Chapter 1.


0.2       Beyond Monte Carlo Simulation
The two most widely considered alternatives to the standard Monte Carlo simula-
tion are those based on change-of-measure techniques and those based on branching
                                                                                   5


processes. The schemes that this dissertation is interested in belongs to the second
category.


Importance Sampling Schemes

The first alternative is usually referred as importance sampling. Suppose that the
probability of interest takes the form p = P(X ∈ A) = µ(A), where A is a subset of
some reasonably regular space and µ a probability measure. Contrast to ordinary
Monte Carlo which samples directly from µ, importance sampling uses an alternative
                                               ¯ i from ν, and then estimates via the
sampling distribution ν, generates iid samples X
                       ¯ i )1{X¯ ∈A} . The Radon-Nikodym derivative [dµ/dν](X
sample mean of [dµ/dν](X                                                    ¯i)
                                i


guarantees that the estimate is unbiased. The goal is to choose ν so that the in-
                         ¯ i )1{X¯ ∈A} cluster tightly around p, thereby reducing the
dividual samples [dµ/dν](X        i


variance. The characterization of optimal achievable performance of importance
sampling schemes in the large deviation is discussed in [10, 11] in terms of a deter-
ministic differential game. And [12] suggests a way to construct importance sampling
schemes based on subsolutions of the Isaacs equation.


Branching Type Schemes

For branching type schemes, the most discussed application is the first entrance
probabilities, and to continue the discussion we specialized to that case. Consider a
Markov process X = {Xi , i = 0, 1 . . .}, and the probability of interest is


                            p = P {τ A ≤ τ A∪B |X(0) = x}


for sets A and B, where τ C = min{i ≥ 0 : Xi ∈ C}. We consider just the discrete
time case, but note that the continuous time case is also of interest. Suppose that
B is a “typical set” and that A is rare, in that with high probability (perhaps very
high probability) B is entered prior to A.
                                                                                       6


    When using branching processes, the idea is to branch the process in such a
way that the a priori rare outcome becomes more likely for at least some of the
descendants (or particles), and then normalize to produce a single unbiased estimate
of p. As with ordinary Monte Carlo one can then generate a number of independent
copies of the branched process, and use the sample average to estimate p. Since the
copies are independent the performance of the scheme is determined by the variance
of a single copy. Key questions in the construction of the branching process are
then when to branch (the triggers) and how much to branch (the rates). A natural
and standard approach to defining triggers is the use of thresholds, i.e., nested sets
{Cj , j = 0, 1, . . . , m} with


                                  C0 ⊃ C1 ⊃ · · · ⊃ Cm ⊃ A,


and a rule for branching when these sets are entered. For example, branching of
a particular particle could occur the first time either it or perhaps some selected
collection of particles containing it first reaches a threshold set.
    Within a framework where branching is triggered by thresholds there are many
possible variants. In what appears to be the oldest class of such methods, particles
do not interact, and branching of any particular particle occurs the first time it
reaches a threshold that has not been reached before. The number of descendants
could be random and it could depend on other factors such as the position of the
particle when entering the threshold, but it is independent of the state of all other
particles at the time of the branching. (If more than one new threshold is entered at
the same time then the distribution of the number of descendants is an appropriate
convolution.) The mean number of descendants is often a constant (an assumption
which for purposes of discussion we will assume true), and hence these types of
schemes are sometimes called Fixed Rate Branching schemes. Besides the basic fixed
branching rate scheme [16, 17, 19], more efficient variants which kill off particles that
are of little use while remaining unbiased have been developed [3, 18, 22, 23, 24].
                                                                                         7


There are many challenges in the design of such schemes. For a large scale problem
care must be taken to ensure that the total number of particles, whose distribution
obviously depends on the precise placement of the thresholds, is not too large. At
the same time, significant reduction of the variance of the estimator requires that
enough particles make it all the way to the rare set. The challenge of the design
problem is to find the proper balance between these opposing requirements.
   In a second class of methods, a particle can be branched only when all particles
that started from the previous threshold (say Cj−1 ) have either reached Cj or B.
The branching here will not be conditionally independent of the state of all other
particles, and in fact the branching is done so as to maintain a fixed and determin-
istic number (say N ) of particles at each threshold. To be precise, suppose that
{Xj−1,n , n = 1, . . . , N } denote the locations of N particles in threshold Cj−1 . Start-
ing with these locations, particles are independently evolved under the dynamics of
X until either Cj or B is reached. Suppose that Nj ≤ N particles reach Cj before B,
and let Sj denote the collection of corresponding locations. In order to maintain N
particles, we let Xj,n for n = 1, . . . , Nj be chosen to coincide with the Nj positions in
Sj (the order is unimportant), and then construct Xj,n for n = Nj + 1, . . . , N by in-
dependently sampling from Sj according to the uniform distribution. [If Nj = 0 the
simulation terminates and zero is returned as the estimate.] Repeating this process
over say m thresholds which separate the initial point from A, we get an unbiased
estimate
                                              Y
                                              m
                                                Nj
                                       pˆ =                                          (0.2)
                                              j=1
                                                    N

of P {τ A ≤ τ A∪B }.
   Methods of this sort are studied in [4, 5, 14, 15], and since the total computational
effort (or at least the total number of particles) is a priori known, they are referred
to as Fixed Effort Branching methods. Other names used for these sorts of schemes
are interacting particle systems (IPS) and genealogical particle methods.
   Large deviation analysis of both standard splitting and the variation known as
                                                                                          8


RESTART have been carried out in [2] and [3]. Under the large deviation scaling (and
for light-tailed processes), it is natural to let the thresholds be defined as level sets of
a function V that is independent of m, and then let the “gaps” between levels scale
proportionally to 1/m. Associated to the estimation problem is a partial differential
equation [PDE] (specifically, a first order Hamilton-Jacobi-Bellman equation) defined
in terms of the sets A and B and the large deviation rate function of X m . The
first conclusion of [2] and [3] is that the total computational effort associated with
generating a single sample scales subexponentially with m if and only if V / log R
is a subsolution to this PDE, where R is the mean number of particles produced
at each branching. It is well known that pm decays like e−mW (x) , where W is a
weak-sense solution to the same PDE. The second conclusion is that the second
moment of the estimator (defined via the thresholds of V and with mean branching
rate R) decays like e−m[W (x)+V (x)/ log R] , where x is the starting point X m (0) = x.
Hence, in this rough exponential sense, the standard deviation of the estimator is
of the same size as the probability of interest if and only if V (x)/ log R achieves the
maximum possible value at x [i.e., W (x)]. When this is not the case the ratio of
the standard deviation to the probability grows exponentially as em[W (x)−V (x)/ log R]/2 .
Hence significant improvement over ordinary Monte Carlo [i.e., V (x) = 0] is still
possible, and the improvement is quantified by V (x).
   In contrast, very little is known regarding the performance of fixed effort schemes
in the small probability limit. Analysis to date have studied the variance when the
number of particles is first sent to ∞ (so that exchangeability and asymptotically
negligible interaction between particles can be exploited), and then the probability
is made small. Since fixed effort schemes are at least in part motivated by a desire
to control computational effort in a predictable way, this is not an ideal measure of
performance. One would prefer to keep the number of particles fixed while letting the
probability tend to zero and, as with the fixed rate schemes, compare the standard
deviation of the estimator with the probability. It is not surprising that such an
                                                                                     9


analysis is difficult. With standard splitting, any two particles that reach A have
a fairly simple dependence. At some time in the past these particles have a last
common ancestor. Prior to that the particles have the same history and after that
the evolutions of the two are independent. One can partition particles according to
the time and location of the last ancestor, and in this way exploit this relatively
simple dependence. More complex implementations are more difficult to analyze,
but the dependence between particles is still limited.
   The situation with fixed effort schemes is very different. Here dependence be-
tween particles is reintroduced with each resampling, which makes the analysis much
more difficult. In the non-interacting case, one can analyze the variance of the es-
timator using large deviation estimates and techniques related to Freidlin-Wentsell
type “small noise” asymptotics. In the interacting case the relevant techniques are
related to those used to analyze the empirical or occupation measure of a Markov
process, though there are numerous difficulties that make the analysis significantly
more technical and complex. Among the difficult features besides the recoupling
mentioned previously are: (i) the natural time scale for the process is not the natural
time scale for the algorithm, which is in fact indexed by thresholds, (ii) the tran-
sition kernels associated with threshold crossing typically do not satisfy the Feller
property, (iii) the transition kernels also depend on the large deviation parameter
m, and (iv) the derived estimates should be valid all the way from a small number
of particles (where large deviation effects dominate) to many particles (where law of
large numbers effects dominate).
   In this thesis we consider only the one dimensional problem, which is already
somewhat demanding, and leave the multidimensional problem for further study. An
outline of the thesis is as follows. In the next chapter we formulate the problem and
the algorithm in precise terms, and state most of the assumptions that will be used.
We also discuss the failure of the Feller property mentioned previously, and indicate
how this issue will be addressed in the analysis. Chapter 2 gives the statement of
                                                                                    10


the main result, which is the large deviation decay rate for the second moment of the
IPS estimator for an arbitrary number of particles. Chapter 3 develops a convenient
stochastic control representation for this second moment. Chapter 4 proves tightness
of measures and processes appearing in the representation, characterizes weak limits,
and proves the large deviation upper bound. Ergodicity properties for controlled
processes are discussed in Chapter 5, after which the large deviation lower bound is
proved. Chapter 6 studies the decay rate as a function of the number of particles.
Among the conclusions of this section are that for any fixed N the IPS method has a
suboptimal rate of convergence of the second moment to zero, and with a well defined
“gap.” However, another conclusion is that the IPS method is also less sensitive to
design issues than competing schemes, at least in the one dimensional setting. The
issue is less clear cut for multidimensional problems.


0.3     Preliminaries
In this section, we states some useful concepts and theorems which will be needed
throughout this thesis.
   We begin by giving the definition of Laplace principle which can be seen as an
alternative definition for large deviation principle (Definition 0.2). See [9, Section
1.2] for detail discussions. In this dissertation, we will use this definition to prove
the large deviation property of the second moment of IPS estimator.

Definition 0.3 (Laplace Principle). Let I be a rate function on X . The sequence
{X m } is said to satisfy the Laplace principle on X with rate function I if for all
bounded continuous functions h

                    1
              lim     log E {exp [−mh (X m )]} = − inf {h (x) + I (x)} .
              m→∞   m                              x∈X
                                                                                        11


The term Laplace principle upper bound refers to the validity of

                        1
              lim sup     log E {exp [−mh (X m )]} ≤ − inf {h (x) + I (x)}
               m→∞      m                              x∈X


for all bounded continuous function h while the term Laplace principle lower bound
refers to the validity of

                       1
             lim inf     log E {exp [−mh (X m )]} ≥ − inf {h (x) + I (x)} .
              m→∞      m                              x∈X


for all bounded continuous functions h.

   Another important concept is relative entropy, which will play a key role in the
weak convergence approach to large deviations via the variational formula given in
part (a) of Proposition 0.5. The derivation of this formula requires a measurable
space (S, A). We denote by P(S) the set of probability measures on (S, A).

Definition 0.4 (Relative Entropy). Given γ, θ ∈ P(S), the relative entropy of γ
with respect to θ is defined by
                                          Z                  
                                                        dγ
                            R(γ kθ ) =        log          (x) γ(dx)
                                          S             dθ

if γ is absolutely continuous with respect to θ, i.e. γ  θ, and log([dγ/dθ](x)) is
integrable with respect to γ, and by ∞ otherwise.

   The next proposition [9, Proposition 1.4.2], which states the variational formula,
is a cornerstone for the weak convergence approach.

Proposition 0.5. Let (S, A) be a measurable space, k a bounded measurable function
mapping S into R, and θ a probability measure on S. The following conclusions hold.

 (a) We have the variational formula
                                 Z                                       Z     
                                     −k
                         − log       e dθ = inf             R (γ kθ ) +       kdγ .   (0.3)
                                 S            γ∈P(S)                      S
                                                                                   12


 (b) Let γ 0 denote the probability measure on S which is absolutely continuous with
      respect to θ and satisfies

                                     dγ 0     .              1
                                          (x) = e−k(x) · R −k .
                                     dθ                   S
                                                            e dθ

      Then the infimum in the variational formula (0.3) is attained uniquely at γ 0 .

   An extended version of Proposition 0.5 will also be needed, given in [9, Proposi-
tion 4.5.1].

Proposition 0.6. Let S be a Polish space, k a measurable function mapping S into
R which is either bounded from below or bounded from above, and θ a probability
measure on S. Then we have the variational formula
                            Z                         Z   
                                −k
                    − log       e dθ = inf  R (γ kθ ) + kdγ ,
                            S            γ∈∆(S)               S

           .
where ∆(S) = {γ ∈ P(S) : R(γkθ) < ∞}.

   The next lemma and theorem [9, Lemma 1.4.3, Theorem B.2.1] investigate some
properties of relative entropy: convexity, lower semicontinuity, compactness of level
sets, uniform integrability of sequences of measures satisfying a suitable relative
entropy bound, factorization, approximation by sums and chain rule, etc.
   Let X be a Polish space. We denote by Cb (X ) the space of bounded continu-
ous functions mapping X into R and by Ψb (X ) the space of bounded measurable
functions mapping X into R.

Lemma 0.7. Let X and Y be Polish spaces. The relative entropy R(·k·) has the
following properties.
                                                                                             13


(a) (Donsker-Varadhan variational formula.) For each γ and θ in P(X )
                                                  Z                   Z               
                                                                               g
                     R (γ kθ ) =          sup              gdγ − log       e dθ
                                      g∈Cb (X )     X                  X
                                                  Z                   Z               
                                                                                   ψ
                                =         sup              ψdγ − log           e dθ .      (0.4)
                                      ψ∈Ψb (X )        X                   X


(b) R(γkθ) is a convex, lower semicontinuous function of (γ, θ) ∈ P(X )×P(X ). In
    particular, R(γkθ) is a convex, lower semicontinuous function of each variable
    γ or θ separately. In addition, for fixed θ ∈ P(X ) R(·kθ) is strictly convex on
    the set {γ ∈ P(X ) : R(γkθ) < ∞}.

(c) For each θ ∈ P(X ) R(·kθ) has compact level sets. That is, for each M < ∞
    the set {γ ∈ P(X ) : R(γkθ) ≤ M } is a compact subset of P(X ).

(d) Let X = Rd and let {γ n , n ∈ N} and {θn , n ∈ N} be sequences in P(Rd ).
    Assume that for each α ∈ Rd
                     Z
               sup        exp hα, yi θn (dy) < ∞ and sup R (γ n kθn ) < ∞.
               n∈N   Rd                                         n∈N


    Then {γ n } is both tight and uniformly integrable in the sense that
                                      Z
                            lim sup                         kyk γ n (dy) = 0.
                           C→∞ n∈N       {y∈Rd :kyk>C}


(e) Let X = Rd and let γ and θ be probability measures on Rd . Assume that for
    each α ∈ Rd                  Z
                                          exp hα, yi θ (dy) < ∞
                                    Rd
                                      R
    and that R(γkθ) < ∞. Then             Rd
                                               kykγ(dy) < ∞.

(f ) Let σ(dy|x) and τ (dy|x) be stochastic kernels on Y given X and θ a probability
    measure on X . Then the function mapping x ∈ X 7→ R(σ(·kx)kτ (·|x)) is
                                                                                    14


     measurable and
                    Z
                         R (σ (· kx) kτ (· kx)) θ (dx) = R (θ ⊗ σ kθ ⊗ τ ) .
                     X


 (g) We denote by Π the class of all finite measurable partitions of X . Then for
     each γ and θ in P(X )

                                                 X                 γ (A)
                              R (γ kθ ) = sup          γ (A) log         ,
                                           π∈Π
                                                 A∈π
                                                                   θ (A)

     where the summand equals 0 if γ(A) = 0 and equals ∞ if γ(A) > 0 and
     θ(A) = 0. In addition, if A is any Borel subset of X , then

                                                          γ (A)
                                R (γ kθ ) ≥ γ (A) log           − 1.
                                                          θ (A)


   Let X and Y be Polish spaces and let τ (dy|x) be a family of probability measures
on Y parameterized by x ∈ X . We call τ (dy|x) a stochastic kernel on Y given X
if for every Borel subset E of Y the function mapping x ∈ X 7→ τ (E|x) ∈ [0, 1] is
measurable.

Theorem 0.8 (Chain Rule). Let X and Y be Polish spaces and α and β probability
measures on X × Y. We denote by α1 and β 1 the first marginals of α and β and
by α(dy|x) and β(dy|x) the stochastic kernels on Y given X for which we have the
decompositions


     α (dx × dy) = α1 (dx) ⊗ α (dy |x) and β (dx × dy) = β 1 (dx) ⊗ β (dy |x ) .


Then the function mapping x ∈ X 7→ R(α(·|x)kβ(·|x)) is measurable and
                                           Z
              R (αk β) = R (α1 k β 1 ) +        R (α (· |x)k β (· |x )) α1 (dx) .
                                            X
                                                                                               15


   The next theorem [9, Theorem A.5.6] states the decomposition of stochastic ker-
nels.

Theorem 0.9 (Decomposition Theorem). Let σ(dx × dy|z) be a stochastic kernel on
X × Y given S and θ a probability measure on (S, A). We denote by σ 1 (dx|z) the
first marginal of σ(dx × dy|z), which is the stochastic kernel on X given S defined
             .
by σ 1 (A|z) = σ(A × Y|z) for Borel sets A and z ∈ Y. Then there exists a stochastic
kernel σ(dy|x, z) on Y given X × S such that θ-a.s. for z ∈ S
                          Z                                    Z
        σ (A × B |z ) =         σ 1 (dx |z ) σ (dy |x, z ) =       σ (B |x, z ) σ 1 (dx |z )
                          A×B                                  A


for all Borel subesets A of X and B of Y. This decomposition is summarized as
σ(dx × dy|z) = σ 1 (dx|z) ⊗ σ(dy|x, z).

   Assume p(x, dy) a transition probability function defined on polish space S. The
Feller property ([9, Condition 8.3.1]) is defined as following:

Definition 0.10 (Feller Property). The function mapping x ∈ S 7→ p(x, ·) ∈ P(S)
is continuous in the topology of weak convergence on P(S); i.e., if {xn , n ∈ N} is
any sequence in S such that xn → x ∈ S, then p(xn , ·) ⇒ p(x, ·).
           Part I

Problem Formulation and Main
          Results


             16
Chapter 1

Notation and Problem
Formulation

In this chapter we formulate the problem of interest, which is the analysis via large
deviations of the performance of an interacting particle system (IPS) for rare event
estimation. We begin by identifying the event of interest and the related large
deviation asymptotics in the next section. In the following section the estimator is
constructed and standard performance criteria for rare event estimation are reviewed.
In the last section we show how these criteria are reformulated in terms of the
occupation measure of a related stochastic process.


1.1     Process Model and the Event of Interest
In this thesis, we restrict discussion to a one dimensional process, but within that
framework allow a state dependent discrete time Markov model. The process model
is constructed as follows. Let {θi (x) , i ∈ N0 , x ∈ R} be a sequence of independent
and identically distributed (iid) random fields with the common distribution


                            P {θi (x) ∈ dy} = µ (dy |x ) .                      (1.1)


                                         17
                                                                                      18


For m ∈ N, define a Markov chain by setting Z0m = 0 and recursively defining

                                m                1
                               Zi+1 = Zim +        θi (Zim ) .                     (1.2)
                                                 m

Here m is the large deviation parameter, and we are interested in events that become
rare as m → ∞. We also consider a continuous time interpolant, which is defined
by Z m (t) = Zim if t ∈ [i/m, i/m + 1/m). We consider Z m (·) to take values in the
standard Skorokhod space.
   Conditions 1.1 and 1.3 introduced below will be assumed throughout the paper.
Condition 1.2 will be replaced by the strictly stronger Condition 1.4 when discussing
the performance of the IPS scheme.

Condition 1.1 (Stability). supx∈R Eθi (x) < 0.

   The event of interest is then

                                .
                             pm = P {Zσmm ≥ 1 |Z0m = 0} ,                          (1.3)


where σ m = min {i ≥ 0 : Zim ∈ [1, ∞) ∪ [−∞, −1/m)}. Although the methods we
develop below can be used to analyze a variety of problem formulations, the event
studied here is common in the rare event literature for the analysis of algorithms. To
identify the rate of decay via large deviations, we need some standard notation and
conditions. For the event of interest it is only the definition of µ (dy |x) on [−1/m, 1]
that matters, and so we pose conditions henceforth only on x ∈ [−1, 1]. For each
such x and α ∈ R, let                        Z
                            H (x, α) = log       eαy µ (dy |x ) .                  (1.4)
                                             R

For a Polish space S, let P (S) denote the set of probability measures on S, equipped
with the topology of weak convergence.

Condition 1.2. For each α ∈ R, supx∈[−1,1] H (x, α) < ∞, and the mapping x ∈
R 7−→ µ (· |x ) ∈ P (R) is continuous in the topology of weak convergence on P (R).
                                                                                                19


   The following properties are well known under the last condition [9]. Let L(x, ·)
be the Legendre transform of H(x, ·), i.e., for β ∈ R


                                 L(x, β) = sup [αβ − H (x, α)] .
                                             α∈R


Then L(x, β) ≥ 0 and L(x, β) = 0 if and only if β = Eθi (x). Condition 1.2 is
sufficient for Z m to satisfy a large deviation upper bound on path space. Various sets
of conditions suffice for the corresponding lower bound, and the particular condition
that is assumed below is sufficient for the lower bound and the existence of a large
deviation limit for pm . See [9] for alternatives.

Condition 1.3. The interior of the convex hull of the support of µ (· |x) is indepen-
dent of x and contains 0.

   Under all three conditions we can assert that

                  1              1
               lim  log pm = lim   log P {Zσmm ≥ 1 |Z0m = 0} = −ρ,                        (1.5)
              m→∞ m         m→∞ m


where
             Z     τ                                                                       
  ρ = inf                       ˙
                        L(φ(s), φ(s))ds : φ(0) = 0, φ(τ ) = 1, φ(s) ∈ (0, 1) for s ∈ (0, τ )
                0
      ∈ (0, ∞),


and where the integral is taken to be ∞ if φ is not absolutely continuous.


1.2      Interacting Particle System (IPS) Estimator
While the large deviation asymptotics described in the last section give an estimate
of the probability for any m, this estimate can be off significantly for values of m that
are of interest. For this reason there is much interest in improved approximations,
and in particular in higher accuracy Monte Carlo approximations.
                                                                                       20


    The construction of an IPS estimator for pm can be described as follows. We
introduce a sequence of sets (often defined in higher dimensions via level sets) of the
form
                                   m    m
                                  Cm ⊂ Cm−1 ⊂ · · · ⊂ C0m .

In the present one dimensional setting we will use Cjm = {x : x ≥ j/m}. (An inter-
esting feature of the IPS scheme that will be proved is that, in the one dimensional
setting, it is relatively insensitive to this choice). The construction of the estimate
is recursive and indexed by stages, which are identified with the entrance into a
corresponding set (we will use the terms stage and threshold interchangeably). The
construction at each stage is partitioned into transportation and resampling aspects.
For particles n = 1, . . . , N and thresholds j = 0, . . . , m − 1, define stopping times
σm
 j,n and stopped processes recursively as follows. Let {θ j,n,i (x) , x ∈ [−1, 1]} be ran-

dom fields, which are independent and identically distributed for distinct values of
j ∈ {1, . . . , m − 1}, n ∈ {1, . . . N } and i ∈ N0 .

Stage j = 0. Simulate N copies of (1.2), i.e., define

                      m          m         1         m
                                                                 m
                     Z0,n,i+1 = Z0,n,i +     θ0,n,i Z0,n,i   ,   Z0,n,0 = 0,
                                           m

for n = 1, . . . , N and let

                                                              	
                    σm                  m        m
                     0,n = min i ≥ 0 : Z0,n,i ∈ C1 ∪ (−∞, −1/m] ,


be the stopping time of the nth particle hitting the next stage C1m , or being absorbed
by (∞, −1/m].
                                   m     m
Stages j = 1, . . . , m − 1. Let Yj,n = Zj−1,n,σ m
                                                 j−1,n
                                                       , i.e., the location of the nth
particle when it first enters Cjm or (−∞, −1/m]. Let Sjm = {Yj,n
                                                              m      m
                                                                 : Yj,n ∈ Cjm } and
let Njm ≤ N be the number of elements of Sjm . Set Xj,n
                                                    m       m
                                                        = Yj,n      m
                                                               if Yj,n ∈ Cjm , and
        m
choose Xj,n for indices n not corresponding to an element of Sjm uniformly from Sjm .
                                                                                                               21

                  m          m        m
Define processes Zj,n,i via Zj,n,0 = Xj,n and

                                     m          m          1         m
                                                                           
                                    Zj,n,i+1 = Zj,n,i +      θj,n,i Zj,n,i   ,
                                                           m

for n = 1, . . . , N , and again, let

                                                                        	
                        σm
                         j,n = min  i ≥ 0 : Z m
                                              j,n,i ∈ C m
                                                        j+1 ∪ (−∞, −1/m]   .


    In general, the particles are thought of as evolving according to a two-step, genetic
type mechanism, though we eschew the associated terminology here:

      m              m
                            transportation   m                m
                                                                     resampling  m                m
                                                                                                         
     Xj,1 , . . . , Xj,N         −→         Yj+1,1 , . . . , Yj+1,N      −→      Xj+1,1 , . . . , Xj+1,N   .


Since each particle is distributed according to the first entrance distribution into
threshold j, Njm /N is an unbiased estimate of


                 P {Z m reaches threshold j |Z m reached threshold j − 1} .


Thus,
                                                        Y
                                                        m
                                                          Njm
                                                  m
                                                 pˆ =                                                     (1.6)
                                                        j=1
                                                              N

is in fact an unbiased estimate of

                  Y
                  m
            m
          p =           P {Z m reaches threshold j |Z m reached threshold j − 1 } .
                  j=1


    The standard measure of performance in these circumstances is the variance of
pˆm , which is to be minimized. Since Eˆ
                                       pm = pm , minimization of the variance is
                                                  pm )2 . Recall the discussion in
equivalent to minimization of the second moment E(ˆ
Chapter 0, we say the estimator pˆm is asymptotically optimal if the upper bound is
                                                                                  22


achieved, i.e., if
                                       1
                               lim −            pm )2 = 2ρ.
                                         log E (ˆ
                               m→∞     m
Of course, besides performance one must consider the computational cost to produce
each sample pˆm . However, (0.1) provides a bound on the best possible performance.


1.3       Reformulation as an Occupation Measure
          Problem
The IPS approach to rare event estimation requires a very different set of tech-
niques for its analysis than approaches based on branching processes that use a
pre-determined branching rate. In particular, the statistical coupling of particles
that occurs at each resampling stage prevents direct use of the “small noise” large
deviation properties of the underlying distribution of Z m . Instead, large deviation
asymptotics for the occupation measures associated with the particles that make it
to the next threshold are relevant. Indeed, if

                                      1 X
                                              m
                                     m
                                  L =      δN m ,                               (1.7)
                                      m j=1 j

then the performance measure

                                            Ym  m 2
                       1       m 2   1          Nj
                              p ) = − log E
                      − log E(ˆ
                       m             m      j=1
                                                N

can be rewritten in the form
                     " P          m #
               1       2 m
                                  Nj
                                           1     h                  i
                         j=1 log                           n
                                        = − log E e2m log( N )L (dn) .
                                                               m
                                                     R
      ˆm
      W     = − log E e            N
                                                                                (1.8)
       N
               m                           m

This exhibits the performance measure as a Laplace functional of the occupation
measure Lm . However, for many reasons the analysis of Lm is nonstandard, including
                                                                                      23


the following.

  1. The “transition kernel” for Njm is only implicitly defined in terms of the basic
      problem data [i.e., the process dynamics µ (dy |x )], and involves a two-stage
      (transportation and resampling) construction.

  2. The transition kernel will depend on x through µ (dy |x ). However, x here is
      identified with the stage of the construction, and so plays the role of a “time”
      variable and not a state variable.

                                                                              ˆ m in
   In the following chapters we carry out a full large deviations analysis of W
the limit m → ∞. However, the resulting rate will depend in a nontrivial way on the
fixed number of particles N . As a consequence, the dependence of the rate constant
on the number of particles should also be analyzed.


1.4      First Entrance Distributions and the Feller
         Property
As discussed in the last section, the large deviation properties of the IPS estimator
are intimately connected with the large deviation properties of a related empirical
measure on the number of particles making it to the next threshold (success parti-
cles). The number of success particles is not by itself Markovian, and a transition
kernel on first entrance distributions plays an essential role in its analysis. In this
section the transition kernel that arises in the limit will be defined, and properties
that are needed for a large deviation analysis will be discussed.
   Anticipating a change in notation later on, we replace x by t, and recall that for
t ∈ [0, 1], θi (t) denotes a sequence of iid random variables with distribution µ(dy|t).
Define Zi (t) by Zi+1 (t) = Zi (t) + θi (t) and set σ(t) = min {i ≥ 0 : Zi (t) ≥ 1}. Note
that t here plays the role of a parameter. We next define a family of stochastic kernels
indexed by t. The kernels will be measures on [0, ∞) ∪ {∆}, where ∆ corresponds to
                                                                                     24


the event that Zi (t) reaches −∞ before exceeding 1, and thereby enters a “cemetery”
state. For v ∈ R and A ⊂ [0, ∞) ∪ {∆}, set
                                                       	
             
                 Z      (t) − 1 ∈ A, σ(t) < ∞ Z0 (t) = v v ∈ (−∞, 1), A ⊂ [0, ∞)
             
             
               P    σ(t)

γ (A|v, t) =   P {σ(t) = ∞| Z0 (t) = v}                     v ∈ (−∞, 1), A = {∆} ,
             
             
             
              δ (A)
                v−1                                         v ∈ [1, ∞), A ⊂ [0, ∞)
                                                                                (1.9)
where δ a denotes the Dirac function at a.
   For each fixed t, v → γ(·|v, t) is a transition kernel on [0, ∞) ∪ {∆} that reflects
the distributional properties of the “overshoot” i.e., the location of the process when
it first enters the next threshold given that the current overshoot is v 6= ∆, when the
“local” characteristics of the increments in the underlying process are determined by
µ(dy|t). It is the local transition kernel that will be relevant in the IPS scheme in
the limit m → ∞, in a sense that will be made precise in Lemma 3.2.
   Before proceeding we review some issues concerning large deviations for the occu-
pation measures of a Markov chain. The first papers on this topic were [7, 8], which
were subsequently generalized in many directions. One of the key assumptions in [7]
and most other work on the problem is that the transition kernel satisfy the Feller
property as in Condition 0.10. In the present context this would be the requirement
that for each t ∈ [0, 1], the map


                                     v → γ(·|v, t)


is continuous in the topology of weak convergence for v ∈ [0, ∞) ∪ {∆}. While such
an assumption is reasonable in many circumstances, it is restrictive in the current
setting, since γ is defined by a first entrance distribution. The Feller property can
be weakened, as in [9, Section 9.2], which requires that the values of v where it
fails are in some sense “negligible” and thus one is still in an essentially Feller type
framework. In fact, one obtains a rate function that is of exactly the same form as
                                                                                        25


in the Feller case, and we will follow a similar approach here. However, for other,
less structured weakenings of the Feller property, the resulting processes can display
unusual large deviation properties with very non-standard forms for the rate function
(when such processes can be analyzed at all). Indeed, many strange behaviors can
occur when the dynamics of the chain allow the process to spend significant time
near the discontinuities [13].
   A second property that will be needed to establish large deviation bounds is
regularity of γ(·|v, t) in t. Using the Markov property and a standard conditioning
argument, one can uniquely characterize γ(·|v, t) for v ∈ (−∞, 1) and Borel sets
A ⊂ [0, ∞) via the equation [see the derivation of (3.11)]
                           Z
             γ(A|v, t) =               γ(A|v + y, t)µ(dy|t) + µ(A + 1 − v|t).       (1.10)
                           {y:v+y<1}


It is apparent from this equation that the atoms of µ(·|t) play an essential role in
determining whether a Feller property or a “near” Feller property holds, and that
regularity in t will impose further conditions. It should also be noted that even if
µ(·|t) has a density that is as smooth as desired with respect to Lebesgue measure,
γ(·|v, t) will not satisfy the ordinary Feller property because of the discontinuity at
v = 1. We will use the following assumption, which is general enough to cover many
standard models and simple to verify. Note that this condition implies Condition
1.2.

Condition 1.4. Consider reference probability measures ν d and ν c , each with a
finite moment generating function, such that ν c is absolutely continuous with respect
to Lebesgue measure and there is an integer q such that the support of ν d is contained
in Z/q. We then assume µ(dx|t) = fd (x, t)ν d (dx) + fc (x, t)ν c (dx), where fd (x, t) and
fc (x, t) are continuous and uniformly positive and bounded in x and t.

   Under this condition either the Feller property will hold, or a near Feller property
will hold so that the large deviation properties are as if the Feller property were true.
                                                                                                           26


The next lemma identifies the places where the Feller property fails, and also shows
that continuity from the right always holds.

Lemma 1.5. Assume Conditions 1.1 and 1.4. Then for each t ∈ [0, 1]


             {v ∈ [0, 1] : v → γ(·|v, t) is discontinuous} ⊂ (0, 1] ∩ [Z/q] ,


and γ(·|v, t) is continuous from the right for all v.

Proof. Let t ∈ [0, 1] be given, and let θi = θi (t) be iid with distribution µ(·|t) (for
simplicity we omit the t-dependence of θi ). For v ∈ [0, 1), define

               ∞
                 "            #
             . X      X
                      i
          Rv =     v+   θk − 1 1{v+P`                θk <1 for `=1,...,i−1} 1{v+ ik=1 θk ≥1} .
                                                                                P
                                               k=1
               i=1       k=1

                                          P`
Then with the understanding that v +         k=1     θk < 1 for all ` < ∞ means Rv = ∆, the
distribution of Rv is γ(·|v, t). Thus for a bounded and continuous function f : R → R
    Z
         f (x)γ(dx|v, t)
      R
              "∞                 !                                                                 #
               X         X
                         i
        = E        f v+    θk − 1 1{v+P`                 θk <1 for `=1,...,i−1} 1{v+ ik=1 θk ≥1}
                                                                                    P                  .
                                                   k=1
              i=1            k=1


Suppose that v is not an element of (0, 1] ∩ [Z/q]. Then for each i the indicator
functions in the last display are continuous at v w.p.1, and so by the Lebesgue
Dominated Convergence Theorem, if vj → v as j → ∞ then
                        Z                          Z
                             f (x)γ(dx|vj , t) →         f (x)γ(dx|v, t).
                         R                           R


The analogous argument applies when v ∈ (0, 1] ∩ [Z/q] if we restrict to vj ↓ v, which
proves the right continuity.


   We will also need the following result, which indicates how the places where the
                                                                                             27

                                                                   R
Feller property fails can be avoided. Let µn ⊗ γ(A × B|t) =            A
                                                                           γ(B|v, t)µn (dv), and
similarly for µ ⊗ γ.

Lemma 1.6. Fix t ∈ [0, 1], and assume that


               {v ∈ [0, 1] : v → γ(·|v, t) is discontinuous} ⊂ (0, 1] ∩ [Z/q] ,


and γ(·|v, t) is continuous from the right for all v. Let µn ⇒ µ, where µn ∈ P([0, 1]).
Assume that for all ε > 0 there is δ > 0 such that µn (∪qk=1 ([k/q] − δ, [k/q])) ≤ ε.
Then µn ⊗ γ(dv × dy|t) ⇒ µ ⊗ γ(dv × dy|t).

Proof. To simplify notation we omit the t dependence. We need to show for bounded
and continuous g and h that
           Z                                     Z
                g(v)h(y)(µn ⊗ γ)(dv × dy) →          g(v)h(y)(µ ⊗ γ)(dv × dy).

               R
Now f (v) =        h(y)γ(dy|v) is continuous from the right for all v and from the left
save (possibly) those v of the form k/q. Let f δ (v) be continuous, have the same
uniform bound as f , and equal to f outside ∪qk=1 ([k/q] − δ, [k/q]). Then
       Z           Z             Z                  
                                                    
        g(v)µn (dv) h(y)γ(dy|v) − g(v)µn (dv)f δ (v) ≤ 2ε khk kgk .
                                                             ∞   ∞


Since µn ⇒ µ implies µ(∪qk=1 ([k/q] − δ, [k/q])) ≤ ε,
        Z         Z             Z              
                                               
         g(v)µ(dv) h(y)γ(dy|v) − g(v)µ(dv)f (v) ≤ 2ε khk kgk ,
                                            δ
                                                        ∞   ∞


and since f δ (x) is continuous, µn ⇒ µ implies
           Z                        Z                     
                                                          
  lim sup  g(v)h(y)µn (dv)γ(dy|v) − g(v)h(y)µ(dv)γ(dy|v) ≤ 4ε khk∞ kgk∞ .
    n→∞


The result now follows since ε > 0 is arbitrary.
Chapter 2

Statement of the Main Results

In this chapter we present the main theoretical result, which is the limit of the second
moment for the IPS estimator as m → ∞.
   For n = 0, 1, . . . , N , let F N (n) = −2 log (n/N ) (with the understanding that
F N (0) = ∞), and for ` ∈ P ({0, . . . , N }) let

                                
 N  X N
                                 F ,` =   F N (n)`(n).                             (2.1)
                                            n=0


Recall that W N                                         pm )2 . It will be evident from
            ˆ m is defined in (1.8) to be −(1/m) log E (ˆ

the theorem statement that the decay rate for the second moment depends on the
number of particles N . The dependence of the rate on N is discussed in Chapter 6.
   As discussed in section 1.3, study of the IPS requires the large deviation analysis
of an occupation measure problem with a “time” dependent transition kernel. To
distinguish a point in the underlying state space of the process from variables that
identify stages and hence this time variable, we use t rather than x to correspond
to the normalized limits of the stage under the scaling j/m. (Its use in this manner
motivated the switch from µ(dy|x) to µ(dy|t) previously.)
   The local rate function GN (t, φ) introduced below will be explicitly defined in
Section 4.3 [equation (4.22)], after we have discussed the weak limits of certain con-


                                             28
                                                                                                                29


                                                       ˆ m . The second argument φ
trolled IPS models that appear in a representation for W N

will be a probability measure on {1, . . . , N }. The local rate will depend on t since
the number of particles which reach the next level and their locations depend on
the stage through µ(dy|t). Thus the large deviation problem involves a probability
measure-valued process with non-stationary increments, whose increments are essen-
tially occupation measures that reflect the local properties (i.e., µ(dy|x) for x near t).
The large deviation lower bound require a communication condition and a condition
of locally uniform ergodicity on γ(·|v, t) when thought of as a transition kernel for
each fixed t. These conditions are stated in Chapter 4.

Remark 2.1. This thesis is concerned with the efficiency of the IPS system, and for
                                        
        
this problem the function of interest is F N , Lm . However, the proof of Theorem 2.2
                                
         
goes through without change if F N , Lm is replaced by any bounded and continuous
function of Lm . One thereby obtains the full large deviations principle for {Lm } with
the rate function
                                  Z       1                               Z    1          
                 IN (ν) = inf                  GN (t, φ (t)) dt : ν =               φ (s) ds .
                                       0                                    0


Theorem 2.2. Assume Conditions 1.1, 1.2, and 1.4 and define GN : [0, 1] ×
P ({0, . . . , N }) → [0, ∞) by (4.22). Then
                                                        Z                                   
                                                                                       
 N 
            ˆm≥
    lim inf W           inf             R inf                        GN (t, φ (t)) dt + F , ν .               (2.2)
             N
     m→∞            ν∈P({1,...,N }) φ:ν= [0,1] φ(t)dt     [0,1]


If in addition Conditions 5.3 and 5.1 hold, then
                                                        Z                                            
                                                                                           
         
           ˆm
   lim sup W    ≤       inf             R inf                        GN (t, φ (t)) dt + F , ν    N
                                                                                                          .   (2.3)
            N
     m→∞            ν∈P({1,...,N }) φ:ν= [0,1] φ(t)dt        [0,1]
        Part II

Large Deviation Analysis


           30
Chapter 3

Relative Entropy Representation

The key to the proof of Theorem 2.2 is a convenient relative entropy/control repre-
sentation. The first step is to introduce notation which focuses attention on the key
events of interest, i.e., the particles make it to the next level and their distribution
(via the “transportation” measure) and the resampling from among these particles.
The next step is to identify a “local” ergodicity which is central to the form of the
local rate function GN (t, φ). This will require several steps of rewriting the represen-
tation to account for changes of variable, which are needed to rescale and properly
center the “transportation” measure. Once this is done and a suitable representa-
                                                    ˆ m using this representation
tion is obtained, we can analyze the asymptotics of W N

and weak convergence methods. The derivation of the representation is the purpose
of the present chapter.


3.1        The Representation Formula
We first define the single particle (underlying) transportation kernels at the prelimit
level by
                                          n                   o
                          αm
                           j (dy |x ) = P  Z m
                                            σmj
                                                ∈ dy |Z m
                                                        0 = x   ,                  (3.1)


                                            31
                                                                                         32


where x ∈ Cjm \Cj+1
                m
                    , σm                m    m                              m
                       j = inf{i ≥ 0 : Zi ∈ Cj+1 ∪ (−∞, −1/m]}, and where {Zi }

is the underlying process model defined in (1.2). Note that αm
                                                             j is a stochastic kernel
                 m
on (−∞, −1/m] ∪ Cj+1             m
                     given Cjm \Cj+1 .
   Next we define a rescaled version of the original process according to

                                                         
                                 Z˘i+1
                                   m
                                       = Z˘im + θi Z˘im /m .                          (3.2)


It is easy to check that Z˘im = mZim if Z˘0m = mZ˘0m . The thresholds for the rescaled
process are stretched versions of those used for the original process: C˘j = mCjm =
[j, ∞). With notation that follows that of the original IPS model, we denote the
location of particles before and after resampling at stage j by (Y˘j,1
                                                                    m
                                                                       , . . . , Y˘j,N
                                                                                    m
                                                                                       ) and
 ˘ j,1
(X m        ˘ m ). Then the related transportation kernel is defined by
       ,...,X j,N

                                          n                 o
                                          ˘          ˘m
                      ˘m
                      α j (dy
                                            m
                                |x ) = P Zσ˘ m
                                             j
                                               ∈ dy Z0 = x
                                        n                        o
                                     = P mZσmm j
                                                 ∈ dy  |mZ0
                                                           m
                                                             = x

                                    = αm
                                       j (dy/m |x/m) ,


where x ∈ C˘j \C˘j+1 and σ
                         ˘m                ˘m   ˘
                           j = inf{i ≥ 0 : Zi ∈ Cj+1 ∪ (−∞, −1]}. In comparison

                                                          ˘m
with the original transportation kernel defined in (3.1), α j is now a stochastic kernel

on (−∞, −1] ∪ C˘j+1 given C˘j \C˘j+1 . Note that both sets no longer depend on m.
Later in this section a centering convention will be introduced, so that all the trans-
portation kernels are on the same fixed space. For the rest of this chapter we will
focus on the rescaled processes and related measures. This is without loss, in that all
quantities of interest are the same for both systems. With an abuse of notation, we
omit the ”^” from the notation.
   To obtain a control representation, we exploit an underlying Markovian structure.
Slightly complicating the description and notation is the fact that the evolution
interweaves the transportation and resampling steps, and it will be useful in the
                                                              m              m
representation to keep them distinct. At any given stage j, (Xj,1 , . . . , Xj,N ) takes
                                                                                                       33


values in CjN ⊂ RN . Given (Xj,1
                             m              m
                                 , . . . , Xj,N ) = (x1 , . . . , xN ), the transition kernel for
the transportation stage is

                                             Y
                                             N
                                                                                                    
Λm
 j (dy1 , . . . , dyN |x1 , . . . , xN ) =      1Cj \Cj+1 (xn )αm
                                                                j (dyn |xn ) + 1Cj+1 (xn )δ xn (dyn ) .
                                             n=1
                                                                                                     (3.3)
                                        m       m          m
One can construct a measurable mapping gj+1 = (gnum,j+1 , gloc,j+1 ) such that

               m          m                m         m
              gnum,j+1 (Yj+1,1 , . . . , Yj+1,N ) = Nj+1
                m
               gloc,j+1    m
                        (Yj+1,1             m
                                , . . . , Yj+1,N ) = (Y˜j+1,1
                                                         m
                                                              , . . . , Y˜j+1,N
                                                                           m
                                                                                m , 0, . . . , 0),
                                                                                j+1


where {Y˜j+1,i
          m
               } are the locations of particles that made it up to the next level, and
the trailing zeros are place holders which keep the dimension of the vector equal
to N . In fact, there are many such mappings, in that they may produce different
orders for the Y˜j+1,k
                  m
                       , and the particular one used is not important. Now suppose that
       m
given Nj+1 = n and Y˜j+1,k
                      m
                           = y˜k , k ≤ n, a random vector (σ 1 , . . . , σ N ) is generated
according to the product measure


                                       Θ (· |n) = ⊗ni=1 δ i ⊗N
                                                             i=n+1 Un ,                              (3.4)


where Un is the uniform measure on {1, . . . , n}. The role of the ⊗ni=1 δ i is also as a
placeholder, and simply says that the first n particles have already been determined,
and thus resampling is only used to determine the location of the remaining N − n
particles. Then we can set
                                         
                                          y˜         k = 1, . . . , n
                               m              k
                              Xj+1,k   =                                    = y˜σk .
                                          y˜      k = n + 1, . . . , N
                                              σk


There is a measurable map hm              m      m                m
                           j that takes (gj+1 (Yj+1,1 , . . . , Yj+1,N ), σ 1 , . . . , σ N ) into
  m                m
(Xj+1,1 , . . . , Xj+1,N ).
                                                                                                                                           34


    With these definitions, the IPS process can be built on the following probability
space. Let X = {(−∞, −1] ∪ [0, ∞)}N and Y = {1, . . . , N }N , and set


                                                                Z = [X × Y]m .


To simplify the notation, we denote corresponding vectors by bold symbols, e.g.,
x0 = (x0,1 , . . . , x0,N ). We define a probability measure on Z by


Pm (dy1 , dσ 1 , . . . , dym , dσ m )
                                   m                                                   m
 = Λm  0  (dy  1  |x  0 )Θ(dσ   1
                                   g num,1 (y1 ))Λm
                                                   1 (dy 2 |hm m
                                                             1 (g1 (y1 ), σ 1 ))Θ(dσ 2
                                                                                       gnum,2 (y2 )) · · · .


Note that the distribution induced by hm       m
                                           1 (g1 (y1 ), σ 1 ) when y1 and σ 1 have distri-
                           m
bution Λm                                                            m          m
        0 (dy1 |x0 )Θ(dσ 1 gnum,1 (y1 )) is the same as that of (X1,1 , . . . , X1,N ) given
  m              m
(X0,1 , . . . , X0,N ) = (x0,1 , . . . , x0,N ). We can then expand the expectation of (1.8) as

                                           Z          Pm
                                                                          m
                                                                          gnum,j (yj )
                                                                                         

           ˆm              1                     2        j=1   log
           W N          = − log                  e                            N
                                                                                             Pm (dy1 , dσ 1 , . . . , dym , dσ m ).
                           m                 Z


    Applying Proposition 0.5, the relative entropy representation for exponential in-
tegrals, we have

          Z             Pm
                                       m
                                       gnum,j (yj )
                                                      
 1                  2    j=1 log
− log           e                          N
                                                          Pm (dy1 , dσ 1 , . . . , dym , dσ m )
 m
           "Z                                             Z                                                                                   #
        1               1                                           X
                                                                    m                  m
                                                                                      gnum,j (yj )
  = inf   R (Qm kPm ) −                                         2           log                            Qm (dy1 , dσ 1 , . . . , dym , dσ m ) .
    Q m m               m                                  Z        j=1
                                                                                          N


We now factor Qm (dy1 , dσ 1 , . . . , dym , dσ m ) to reflect the way that Pm is factored.
While Pm exhibits a type of Markovian structure, the same cannot be assumed of
Qm . However, we can always factor it so that the resulting conditional distributions
                                                                                                    35


do not anticipate the future. Thus we write


                                        ¯m
Qm (dy1 , dσ 1 , . . . , dym , dσ m ) = Λ         ¯m        ¯m              ¯m         ¯m
                                          0 (dy1 )Θ1 (dσ 1 )Λ1 (dy2 ) · · · Λm−1 (dym )Θm (dσ m ),


      ¯ m (dσ j ) ∈ P(Y) is implicitly a function of (y1 , σ 1 , . . . , yj ) and Λ
where Θ                                                                           ¯ m (dyj ) ∈
        j                                                                           j−1

P((−∞, −1] ∪ Cj ) is implicitly a function of (y1 , σ 1 , . . . , yj−1 , σ j−1 ). Denote the
                 ¯ 1m , σ
distribution of (Y      ¯m            ¯ m ¯m
                          1 , . . . , Ym , σ
                                                     m
                                             m ) by Q . Then using the chain rule for rela-

tive entropy Theorem 0.8 to decompose R(Qm kPm ), we can write the variational
representation as
                       "
                           1 X
                             m−1
                                           m  m m m               
       inf         ¯
                   E               ¯m
                                 R Λ           
                                     j (·) Λj (· hj
                                                         ¯ j ), σ
                                                    gj ( Y      ¯mj )                          (3.5)
 {(Λ¯ m
      j ,Θj+1 )}
         ¯m                m j=0
                                                                                               !#
                 1 X                               1 X
                    m                                 m                               ¯ jm )
                                   m                                       m
                                                                              gnum,j (Y
               +         ¯ m             ¯ m
                       R Θj (·) Θ(· gnum,j Yj ) −        2 log                                      ,
                 m j=1                             m j=1                          N

      ¯ denotes expected value with respect to Qm .
where E
   To proceed with the analysis, we need to introduce a centering in the previous
definitions. The usefulness of the centering is due to the fact that the environment
presented to each of the transportation and resampling measures is in some sense
stationary relative to the starting location Cj of the particles at stage j. Our goal
is analyze what amounts to an empirical measure on the number and locations of
particles that make it to successive levels. For this to work out, the various measures
should be defined on a fixed space that does not depend on j. In fact, it is only
the difference of particle positions before and after transportation that we need to
record. By a change of variable we can redefine the transportation kernel αm
                                                                           j so that

rather than being on (−∞, −1] ∪ Cj+1 given a point in Cj \Cj+1 , we consider them
as measures on the fixed space ∆ ∪ [0, ∞) given [0, 1), where ∆ will be called the
“cemetery” state and plays the role of (−∞, −1], while [0, ∞) plays the role of Cj+1 .
We make the analogous change with respect to the N-particle transportation kernels
Λm                                                        m     m        m
 j defined in (3.3). Under this convention, the mappings gj = (gnum,j , gloc,j ) and
                                                                                                              36


hm
 j now have domains and ranges


                         g : {∆ ∪ [0, ∞)}N → {1, . . . , N } × [0, ∞)N

                         h : {1, . . . , N } × [0, ∞)N × Y → [0, ∞)N .


Thus g = (gnum , gloc ), where gnum counts the number of particles that make it up to
[0, ∞) (termed success particles), and gloc returns the locations of these particles.
    Again we will abuse notation and use the same labels as before for these new
centered measures. In terms of the original random variables, we can exhibit the
centered and rescaled transportation kernel, for x ∈ [0, 1), as
              n                                        o
              P      m                      m
                   mZσjm − (j + 1)   ∈ A (mZ0  −  j) =  x            if A ⊂ [0, ∞)
αm
 j (A |x ) =
                 n                                               o                .
              P                                
                   mZσmm
                       j
                         − (j + 1)   ∈ (−∞, −1]  (mZ 0
                                                       m
                                                         −   j) = x   if A = {∆}
                                                                               (3.6)
The transportation kernel is then

                                               Y
                                               N
                                                                                                       
 Λm
  j (dy1 , . . . , dyN |x1 , . . . , xN ) =       1[0,1) (xn )αm
                                                               j (dyn |xn ) + 1[1,∞) (xn )δ xn −1 (dyn ) .
                                               n=1
                                                                                                      (3.7)
Note that all the transportation kernels are measures on the same space, i.e., {∆ ∪
[0, ∞)}N . We thereby obtain the following variational representation, which is now
amenable to weak convergence analysis:
                                       "
                                           1 X
                                             m−1
                                                           m                 
        VNm =         inf          ¯
                                   E               ¯m
                                                 R Λ               ¯ m ¯m
                                                     j (·) Λj (· h g(Yj ), σ j )                      (3.8)
                 {(Λ¯ m
                      j ,Θj+1 )}
                         ¯m                m j=0
                                                                                                     !#
                   1 X                                  1 X
                         m                                                 m                ¯ jm )
                                                                                  gnum (Y
                 +         ¯m
                         R Θj (·) Θ(·  gnum
                                              ¯
                                              Y m
                                                j  )  −       2 log                                       ,
                   m j=1                                m j=1                             N

where VNm is referred as the minimal cost function.
    Without loss we can restrict to controls for which the cost is finite. This implies
the finiteness of the relative entropies, which in turn implies that the resampling
                                                                                       37


         ¯ m should have the same support as Θ(·|gnum (Y
controls Θ                                             ¯ jm )), w.p.1. Recalling that
           j

Θ (· |n) = ⊗ni=1 δ i ⊗N                          ¯m
                      i=n+1 Un , this means that Θj (· |ω ) will inherit the property that
                                  ¯jm = gnum (Y
the resampling control leaves the N           ¯ jm ) particles that made it up alone,

and sample from among these particles (though perhaps with a distribution not
                                                 ¯ m }) to determine the locations of the
equal to the uniform distribution on {1, . . . , Nj
                    ¯ jm ) particles.
remaining N − gnum (Y
   We formally state the representation so derived.

Theorem 3.1 (Representation Theorem). Let Θ, Λm      m
                                              j and VN be defined by (3.4),

(3.7) and (3.8), respectively. Then for all m ∈ N

                                              Ym  m 2
                                       1          Nj
                             ˆ Nm
                             W      = − log E
                                       m      j=1
                                                  N

is equal to the minimal cost function VNm .


3.2      The Limiting Behavior of the Underlying
         Transportation Kernels
                               ˆ m and identify the rate function, we will need to
To evaluate the asymptotics of W N

identify the limit of Λm                                                     m
                       j . The rescaled and centered transportation kernels αj (dy |x )

are related to the asymptotic first entrance distributions defined in (1.9). The fol-
lowing lemma makes the relation precise.

Lemma 3.2. Assume Conditions 1.1 and 1.4. Define αm
                                                  j (dy |x ) by (3.6), and

γ(dy|x, t) by (1.9), and assume that j/m → t ∈ (0, 1) as m → ∞. Then for
x ∈ [0, 1)
                                    αm
                                     j (· |x ) ⇒ γ(·|x, t),
                                                                                     38


and thus for x ∈ [0, ∞)


                    1[0,1) (x)αm
                               j (· |x ) + 1[1,∞) (x)δ x−1 (·) ⇒ γ(·|x, t).        (3.9)


Moreover, for each x ∈ [0, 1) t → γ(·|x, t) is continuous for t ∈ [0, 1].

Proof. We recall that αm
                       j (· |x ) is defined in (3.6) by setting


                          Zi+1 = Zi + θi (Zi /m),      Z0 = j + x,                (3.10)


and then stopping this process the first time σ m
                                                j when either Zi ≥ j + 1 or Zi ≤ −1,

with the latter event meaning that mass is assigned to ∆, and αm
                                                               j (A |x) = P{Zσ m
                                                                               j
                                                                                 ∈
j + 1 + A} when A ⊂ [0, ∞). Let b ∈ (0, 1), and define


                     σm
                      j (b) = inf{i ≥ 0 : Zi ≥ j + 1 or Zi ≤ jb}.


The parameter b is used to localize the problem, so that we only have to pay attention
to the process when Zi /m is very close to t. Under the stability condition (Condition
1.1), given any a > 0, for all sufficiently large m we have P{Zσm
                                                                j
                                                                  ≥ j + 1 and
Zσm
  j (b)
                                                                     ¯m
        ≤ jb} ≤ a. It follows that if we define “barred” measures by α j (A |x) =

P{Zσm
    j (b)
                           ¯m
          ∈ j + 1 + A} and α j (∆ |x) = P{Zσ m
                                             j (b)
                                                   < j + 1}, then modulo an error in
the total variation norm of size 2a, αm            ¯m
                                      j (· |x) and α j (· |x ) have the same weak limit.

We can also define γ¯ (·|x, t) in the analogous way, and obtain the same total variation
                                                      ¯m
norm error bound. Therefore, it suffices to show that α j (· |x ) ⇒ γ
                                                                    ¯ (·|x, t).
                              ¯m
   Again using Condition 1.1, α j (A |x ) can be characterized as the unique solution
                                                                                           39


to the integral equation


¯m
α j (A |x ) = P{Zσ m  (b) ∈ j + 1 + A}
                n j                      o                                           
            = P     Zσm       ∈j+1+A          1{σm (b)>1} + 1 m                     	
                        j (b)                                {σj (b)=1}∩ Zσm
                                                 j                           (b) ≥j+1
                                                                           j
              Z                                                                     
                                                      j + x                      j + x
            =                   α m
                                ¯ j (A |x + y )µ dy          + µ A + 1 − x             .
               {y:jb/m<x+y<1}                            m                           m
                                                                                      (3.11)


Under Condition 1.4, given b < 1 there exists a cb < 1 such that for b2 t ≤ s ≤ t/b2

                                                      1
                             cb µ(dy|t) ≤ µ(dy|s) ≤      µ(dy|t),                    (3.12)
                                                      cb

where cb ↑ 1 as b ↑ 1.
   For fixed b, choose m big enough that [bj/m, (j + 1)/m] ⊂ [b2 t, t/b2 ]. Define
     .                 P`
Dkm = {xi : jb/m <      i=0 xi < 1 for ` = 0, . . . , k}. Then for n ∈ N and all

x ∈ [bj/m, (j + 1)/m], applying (3.12) and (3.11) recursively gives

                 n Z                       !
                  1                   X n
                                      
   ¯m
   α j (A |x) ≤               ¯m
                              α j   A      xi µ(dxn |t) · · · µ(dx1 |t)
                  cb     Dn m         
                                        i=0
                                Z                         !
                  X 1
                  n−1         k+1                 X k      
                                                           
                +                      µ A+1−          xi  t µ(dxk |t) · · · µ(dx1 |t),
                        c b         m
                                   Dk                      
                  k=0                              i=0
                                                                                     (3.13)


where x = x0 . Note also that γ¯ (·|x, t) solves the equation analogous to (3.11),
save that µ(dy|(j + x)/m) is replaced everywhere by µ(dy|t). Hence a calculation
analogous to the one used in (3.13) gives

                         Z                   !
                                  X n
                                  
        γ¯ (A |x, t) =     γ¯ A        xi , t µ(dxn |t) · · · µ(dx1 |t)
                         m
                        Dn        
                                    i=0
                         n−1 Z
                                                        !
                         X                       Xk    
                                                       
                       +          µ A+1−            xi  t µ(dxk |t) · · · µ(dx1 |t). (3.14)
                              Dkm                      
                             k=0                  i=0
                                                                                       40


   Condition 1.1 implies that for each ε > 0 there exists M < ∞, such that the
probability that the random walk escapes from [−1, j + 1] in M steps is greater then
ε uniformly in x, and thus

                                   	      m              	
               P σm
                  j (b) ≤ M  Z0 = x   ≥ P  σ j ≤ M  Z0 = x   > ε.


On the other hand, the first term in (3.13) can be bounded above in terms of the
probability of not escaping in n steps. In fact, for n = kM + l, k ≥ 1, l < M

                        Z                   !
                                     X n
                               m     
                             ¯j A 
                             α             xi µ(dxn |t) · · · µ(dx1 |t)
                           m
                          Dn         
                                       i=0
                                                                	
                          = P σm   j (b)  > n, Z n ∈ j + 1 +   Ax

                            ≤ (1 − ε)k
                              n                 o(M +1)k
                            =   [1 − ε]1/(M +1)

                            ≤ ρn ,


where ρ = ρ(ε, M ) = [1 − ε]1/(M +1) < 1. Repeating the analysis for γ¯ , we get

                    Z                 !
                                 Xn
                                 
                            γ   A   xi µ(dxn |t) · · · µ(dx1 |t) ≤ ρn .
                      m
                     Dn          
                                     i=0


   Pick b < 1 such that cb satisfies ρ/cb < 1. Then the series in (3.13) is convergent,
and for n large enough


    ¯m
    α j (A |x)
           n X                                        !
                  n−1  k+1 Z                X k      
             ρ         1                               
       ≤        +                      µ A+1−       xi  t µ(dxk |t) · · · µ(dx1 |t)
             cb        cb         Dkm                  
                  k=0                          i=0
           n  n X     n−1 Z
                                                      !
             ρ      1                       Xk      
                                                    
       ≤        +                   µ A+1−       xi  t µ(dxk |t) · · · µ(dx1 |t)
             cb     cb          Dkm                 
                                            i=0
           n  n k=0
             ρ      1
       ≤        +        γ¯ (A |x, t).
             cb     cb
                                                                                       41


With a similar computation we obtain the lower bound

                           Z             !
                                  X n
       m           n        m     
     ¯ j (A |x) ≥ cb
     α                   α¯j A         xi µ(dxn |t) · · · µ(dx1 |t)
                       m
                      Dn          
                                    i=0
                              Z                          !
                     X
                     n−1                       X k      
                          k+1                           
                   +     cb         µ A+1−          xi  t µ(dxk |t) · · · µ(dx1 |t)
                               Dkm                      
                           k=0                         i=0
                 ≥   cnb                   n
                            γ (A |x, t) − ρ ) .
                           (¯


Combining these two inequalities gives

                                                           n
             cnb (¯
                  γ (A |x, t) − ρn ) ≤ α
                                       ¯m                    γ (A |x, t) + ρn ) .
                                         j (· |x) ≤ (1/cb ) (¯


The claim now follows by sending m → ∞ (and using j/m → t), then b ↑ 1, and
finally n → ∞.
   The last sentence of the lemma follows from a very similar construction, and is
hence omitted.
Chapter 4

The Large Deviation Upper Bound

In this chapter we prove (2.2), which corresponds to a large deviation upper bound.
We first rewrite that control representation (3.8) so that the controls and controlled
process are in a space that is independent of m, which facilitates the use of weak
convergence arguments. Tightness is addressed at the beginning of Section 4.2, and
then weak convergence methods are used to relate the limits of the controls to the
limits of the processes. It is this asymptotic relationship which identifies the local
rate function, and in Section 4.3 these results are used to prove a large deviation
upper bound with the given rate.


4.1     Notation and Preliminaries
In this section we make a few final adjustments to the relative entropy representation.
A number of stochastic kernels indexed by t ∈ [0, 1] will be needed. To simplify the
notation, the definition at t = 1 is usually omitted. This is without loss, since these
kernels are always integrated with respect to Lebesgue measure.
                                                ¯m
   Recall the control representation (3.8). Let X         ¯ m ¯m          ¯m
                                                  j = h(g(Yj ), σ j ) and Nj =
      ¯ jm ). These quantities have the exactly same interpretation as for the original
gnum (Y
                                                                ¯jm is the number
process, but with respect to the controlled distributions. Thus N


                                          42
                                                                                                      43


                                           ¯m
of particles making it to threshold j, and X j are the locations after resampling.

We will also want to keep track of the dependence of the transportation control as
                                       ¯m
a function of the state of the process X j . For m ∈ N and t ∈ [0, 1] define the

stochastic kernel
                                                                   
                        .
      ¯ξ m (dx, dy |t ) =             ¯ m (dy) if t ∈       j j+1
                           ¯ m (dx) × Λ
                          δX            j                    ,          , j = 0, . . . , m − 1,
                             j                              m m
                       
      ¯m
where Λ          ¯m     ¯ ¯ 1, . . . , Y
        j (dy) = Λj (dy Y1 , σ
                                        ¯ j, σ
                                             ¯ j ). Also define

                                                                       
      m           .              ¯ m (dσ) if t ∈                j j+1
                        m (dn) × Θ
   η¯ (dn, dσ |t) = δ N¯j+1        j+1                           ,          , j = 0, . . . , m − 1,
                                                                m m
                      
      ¯m
where Θ         ¯m
       j (dσ) = Θj (dσ Y1 , σ
                                       ¯ j ). Random probability measures ¯ξ m ∈
                       ¯ ¯ 1, . . . , Y

P([0, ∞)N × {∆ ∪ [0, ∞)}N × [0, 1]) and η¯m ∈ P({1, . . . , N } × Y × [0, 1]) are then
defined by
                                                Z
                                        .
                     ¯ξ m (A1 × A2 × B) =             ¯ξ m (A1 × A2 |t) dt,                       (4.1)
                                                ZB
                                        .
                      η¯m (C1 × C2 × B) =             η¯m (C1 × C2 |t ) dt                        (4.2)
                                                  B


for Borel subsets A1 of [0, ∞)N , A2 of {∆ ∪ [0, ∞)}N , C1 of {1, . . . , N }, C2 of Y,
and B of [0, 1]. Again, we use the bold uppercase letter to denote the subsets of
higher dimensions. Similarly, measures related to the distributions used to define
the original process are defined by
                                                               
  m             .               m               j j+1
                                      ¯ m ) if t ∈
 r (dx, dy |t ) = δ X
                    ¯ m (dx) × Λj (dy|X j         ,          , j = 0, . . . , m − 1,
                      j                        m m
                                                           
  m             .                   ¯ ) if t ∈
                                      m          j    j + 1
 q (dn, dσ |t ) = δ N¯j+1
                      m (dn) × Θ(dσ|N
                                     j+1            ,         , j = 0, . . . , m − 1,
                                                 m m
                                                                                                  44


and
                                                          Z
                               m           .
                           r (A1 × A2 × B) =                    rm (A1 × A2 |t) dt,             (4.3)
                                                          ZB
                                             .
                           q m (C1 × C2 × B) =                  q m (C1 × C2 |t) dt,            (4.4)
                                                            B


where Ai , Ci , B are as above.
   The chain rule for relative entropy Theorem 0.8 is then used to rewrite (3.8) in
terms of these measures. To be precise, we repeatedly use the fact that for probability
measures and stochastic kernels defined on the appropriate spaces
                                                                Z
              R (µ(dz|u)γ(du) kν(dz|u)γ(du)) =                       R (µ(·|u) kν(·|u)) γ(du)
                                                                 U

                                                                m
to insert the Dirac functions appearing in the definitions of ¯ξ , η¯m , rm and q m without
any change in the cost. For ` ∈ P ({1, . . . , N }) recall the definition

                                         
 N  X  N         n
                                          F ,` =     −2 log     ` (n) .
                                                 n=0
                                                             N

With the 1/m factors accounted for by the piecewise constant nature of the measures
and the Lebesgue integral, (3.8) can be restated as
                          Z                                                                        
                               1         m                                                
 N m
ˆm=
W          inf        ¯
                      E                  ¯           m
                                       R ξ (· |t ) kr (· |t) + R (¯ m         m                   ¯
                                                                  η (· |t ) kq (· |t ))) dt + F , L   ,
 N
        {¯ξm ,¯ηm }        0
                                                                                                (4.5)
where
                                                        X  m
                                                 ¯m = 1
                                                 L         δ ¯m                                 (4.6)
                                                      m j=1 Nj

is the controlled empirical measure.
   The proof of Theorem 2.2 uses weak convergence arguments to analyze the limit
of (4.5). Before proving the tightness results for our problem, it is useful to recall
some criteria for the tightness of probability measures. In the lemmas to follow, V
                                                                                  45


is a Polish space and A is an arbitrary index set. A tightness function on V is a
mapping g : V → [0, ∞] such that for each M ∈ [0, ∞), {v : g(v) ≤ M } is compact.
The first result is given in [9, Theorem A.3.17].

Lemma 4.1. Let g be a tightness function on V. Define a function G (θ) mapping
P (V) into R ∪ {∞} by                          Z
                                     G (θ) =         gdθ.
                                                V

Then the following conclusions hold.

 (a) For each M < ∞, the set ZG (M ) = {θ ∈ P (V) : G (θ) ≤ M } is tight.

 (b) G is a tightness function on P (V).

   The next result can be found in [20, Theorem 6.1].

Lemma 4.2. A family {ν a : (Ω, F, P) → P (V) , a ∈ A} of random probability
measures on a Polish space V is tight if and only if the family of (deterministic)
probability measures {Eν a , a ∈ A} ⊂ P (V) is tight.

   The next lemma gives an elementary sufficient condition for the tightness of
empirical measures.
                                               	
Lemma 4.3. Let      Xj ∈ RN , j = 0, . . . m − 1 be a collection of random vectors.
                                          P
Define the empirical measure J m = m1 m−1   j=0 δ Xj . Suppose that given ε > 0 there

exists b < ∞ such that for all n ∈ {1, . . . , N }

                                1 X
                                  m−1
                                      P (|Xj,n | > b) < ε.
                                m j=0


Then {J m } is tight.

Proof. By Lemma 4.2, it suffices to show that given δ > 0, there is a compact set K
such that EJ m (Kc ) < δ for all m. Given δ, let ε = δ/N and then choose b so that
                                                                                                   46


the previous display holds. Letting K = [−b, b]N , we have

                  1 X                 1 XX
                    m−1                 m−1 N
            m     c              c
        EJ (K ) =       P (Xj ∈ K ) ≤           P (|Xj,n | > b) < N ε = δ.
                  m j=0               m j=0 n=1


4.2       Tightness and Characterization of Limits
We now proceed to prove the tightness of both the underlying measure and the ad-
missible control measures. Note that, owing to the rescaling and centering, measures
such as αm
         j (dy |x ) are stochastic kernels on ∆ ∪ [0, ∞).


Proposition 4.4. Assume Conditions 1.1 and 1.2. For each m ∈ N, consider any
                             ¯m
admissible control sequence {Λ                                ¯m
                               j , j = 0, . . . , m − 1} and {Θj , j = 1, . . . , m} such that

                      (                                                )
                         1  X
                            m−1
                                                                   
                    ¯
                sup E             R Λ¯m      m           ¯ m ¯m
                                       j (·) Λj (· h g(Yj ), σ    j )    < ∞
                m∈N     m    j=0
                              (                                        )
                                 1 Xm
                                                                   
                      sup E¯            R Θ ¯ m (·) Θ(· gnum Y
                                                               ¯ jm )    < ∞,
                                              j
                      m∈N       m  j=1


where Λ         ¯ m (·|Y
      ¯ m (·) = Λ      ¯ 1, σ            ¯ j, σ
                            ¯ 1, . . . , Y    ¯ j ) and Θ         ¯ m (·|Y
                                                        ¯ m (·) = Θ      ¯ 1, σ            ¯ j ). Then
                                                                              ¯ 1, . . . , Y
        j         j                                       j         j

the following results hold.

   1. {rm , m ∈ N} with rm defined in (4.3) is tight.

                                             m
   2. Let the admissible control measures (¯ξ , η¯m ) be defined by equations (4.1) and
                      m
      (4.2). Then {(¯ξ , η¯m ), m ∈ N} is tight.

Proof. Part 1 (tightness of rjm ). We divide the proof for the tightness of rm into
three steps.
Step 1: the tightness of αm . Let H (x, α) be the moment generating function defined
                                   .
in (1.4). By Condition 1.2, H (α) = supx∈[−1,1] H (x, α) < ∞. Given j = 0, . . . , m−1,
                                                                                       47

       m                                                         m
let {ˆθj,i (x) , i ∈ N, x ∈ [−1, 1]} be iid random fields with ˆθj,i (x) ∼ µ (· |x + j/m )
      m
and {Zj,i } be the Markov chain defined by

                                                  
                      m
                     Zj,i+1    m
                            = Zj,i + ˆθj,i Zj,i
                                            m
                                                /m ,    m
                                                       Zj,0 = z < 0.

                                     m           m
Then for each fixed α, the sequence Tj,i = exp{αZj,i − iH (α)} is a supermartingale
                                m        m
with respect to the filtration Fj,i = σ{Zj,s : s ≤ i}. Indeed,

             m  m             h   n m                  o    i
                           m          ˆ      m               m
           E Tj,i+1 Fj,i = Tj,i E exp αθj,i Zj,i /m − H (α) Fj,i
                            m
                                       m
                                                             	
                         = Tj,i exp H Zj,i /m + j/m, α − H (α)
                                m
                             ≤ Tj,i .


Define the stopping time τ m                m
                           j = inf{k > 0 : Zj,k ∈
                                                / (−j − 1, 0)}. Then by the Optional
Sampling Theorem, for α > 0

                             n                    o
                                m         m
                        E exp αZj,τ m − τ j H (α)
                                    j
                                                    ≤ eαz < 1.


If we can prove that H (α∗ ) < 0 for some α∗ > 0, then Chebyshev’s inequality yields

                n           o                n          o
                  m              −α∗ b          ∗ m
               P Zj,τ m ≥ b   ≤ e      E exp  α  Z    m
                                                  j,τ j
                      j
                                   ∗
                                            n                      o
                              ≤ e−α b E exp α∗ Zj,τ
                                                  m        m     ∗
                                                     m − τ j H (α )
                                                     j
                                        ∗
                                 ≤ e−α b ,

                                  m
which implies the tightness of {(Zj,τ m ) ∨ 0, j = 1, . . . , m, m ∈ N}. For later use, we
                                      j
           m
note that Zj,i − z has the same distribution as the process Zi − (j + x) (equation
(3.10)) that appears in the definition of αm
                                           j (·|x) when z = x − 1, and so we have


                                                   −α b∗
                                 αm
                                  j ([b, ∞)|x) ≤ e      .                           (4.7)
                                                                                                         48


   We are now left to show the existence of such α∗ . According to Condition 1.1,
Dα H (x, 0) = Eˆθ0,i (x) < 0 for all x ∈ [−1, 1]. Hence for each x there is αx > 0 such
that H (x, α) < 0 for all α ∈ (0, αx ), and by the continuity of H we can also take αx
to be continuous in x. Since [−1, 1] is compact, inf αx > 0, hence such α∗ exists.
                        m
Step 2: tightness of ¯ξ 1 . By assumption, there exists M < ∞ such that
                                             (                                                )
                                                 1 X
                                                   m−1
                                                                                      
    sup ER   ¯m
        ¯ ξ kr         m           ¯
                             = sup E                     ¯ m (·) Λm (· h g(Y
                                                       R Λ                   ¯ jm ), σ
                                                                                     ¯m           ≤ M.
                                                           j       j                   j )
   m∈N                             m∈N           m j=0
                                                                                                     (4.8)
According to the variational formula in Proposition 0.6 or Lemma 0.7(a), for any
measurable function ψ bounded from below and for each m ∈ N, j ≤ m
                           Z                        Z              
                                                                                  
                       ¯
                       E             ¯m
                                   ψdΛ       − log       eψ
                                                              dΛm         ¯ Λ
                                                                        ≤ ER ¯ m Λm .               (4.9)
                                       j                        j              j   j


Define the measurable function
                                                    
                                                     c  yn > b
                                            ψ (y) =               .
                                                     0 otherwise


                                                        ¯ jm and (4.9) yield
Then applying the inequality (4.7) to each component of Y

                                                     Z              
                                       	
                   ¯
                  cP       Y¯j,n
                              m
                                   ≥b          ¯
                                             = E               ¯m
                                                             ψdΛ j
                                                                          
                                               ¯ Λ
                                             ≤ ER ¯ m Λm + log 1 + ec e−α∗ b .
                                                    j   j


                                                       ¯ m ∈ K is possible only if
For n ∈ {1, . . . , N } and any Borel set K of [0, ∞), X j,n

Y¯j,n
   m                 0
     0 ∈ K for some n ∈ {1, . . . , N }. Therefore, given any ε > 0 there are b < ∞ and


c < ∞ such that

  1 X ¯  ¯m         1 X X ¯ ¯m
    m−1                m−1 N
                  	                      	 NM  N           ∗ 
        P Xj,n ≥ b ≤           P Yj,n ≥ b ≤   + log 1 + ec−α b ≤ ε.
  m j=0              m j=0 n=1              c  c
                                                                                                    (4.10)
                                                                                                      49

        m                                                                      m
With ¯ξ 1 playing the role of J m , Lemma 4.3 then yields the tightness of {¯ξ 1 }.
                                                                             1
                                                                                 Pm−1
Step 3: tightness of rm . If we can show that the measures                   m     j=0   Λm
                                                                                          j also tight,

then tightness of {rm } will follow. From the definition of Λm
                                                             j (·|x) in (3.7) we have

the upper bound

                                      X
                                      N
       Λm
        j   (A1 × · · · × AN |x) ≤          [1[0,1) (xn )αm
                                                          j (An |xn ) + 1[1,∞) (xn )δ xn −1 (An )].
                                      n=1


                                           −α b                ∗
Recall that we have shown αm
                           j ([b, ∞)|x) ≤ e     . Given ε > 0, choose b, c < ∞
                      ∗
satisfying both e−α b < ε and (4.10) and let K = [b, ∞)N . Then
  (                   )
   1  X
      m−1
                     
¯
E                ¯m
          Λm Kc |X
   m j=0 j         j

       ( m−1 N                                                                 )
         1 XXh                                                            i
  ≤ E¯                     ¯ m αm [b, ∞) X
                    1[0,1) X               ¯ m + 1[1,∞) X
                                                        ¯ m δ X¯ m −1 ([b, ∞))
                             j,n   j        j,n           j,n
        m j=0 n=1                                               j,n


       1 X X ¯  ¯m
         m−1 N
                                	
  < ε+           P Xj,n ≥ b + 1
       m j=0 n=1
  < (N + 1)ε.


Applying Lemma 4.2, we complete the proof of the tightness of {rm }.
                            m
   Part 2 (tightness of {(¯ξ , η m
                                 j )}). Tightness of r
                                                       m
                                                         implies that for each i ∈ N
                                                 ¯ m (Ki ) ≥ 1 − 2−i for all m.
there exists a compact set Ki of X × Y such that Er
Suppose we define a by
                                                     ∞
                                                     X
                                      a (·) = 1 +          i1Kci (·) .
                                                     i=1

Then                              Z                   ∞
                                                      X
                              ¯
                          sup E       adrm ≤ 1 +            i2−i = M1 < ∞.
                          m∈N
                                                      i=1

Letting U (·) = log a (·), it follows that U ≥ 0 and U has compact level sets. Thus
                                                                                      50


U is a tightness function such that
                                            Z
                                       ¯
                                   sup E        eU drm ≤ M1 .
                                   m∈N


Then, again applying the variational formula as in (4.9), Jensen’s inequality, and the
uniform bound (4.8),
                          Z                             Z          
                                    m
                      ¯
                  sup E        U d¯ξ                   ¯
                                            ≤ M + sup E log e drU   m
                  m∈N                             m∈N
                                                            Z        
                                                          ¯
                                            ≤ M + log sup E     U
                                                               e dr m
                                                        m∈N
                                            < ∞.

                                           m
Lemma 4.1 then implies the tightness of {¯ξ }.
   Recall that Y = {1, . . . , N }N . Since the measures η¯m take values in the compact
                                                  η m } is automatic.
set P({1, . . . , N } × Y × [0, 1), tightness of {¯


   Next we state the main weak convergence result. Recall that for a probability
measure such as ¯ξ on a product space, ¯ξ i will indicate the ith marginal. The first two
statements are straightforward, and serve to establish notation and decompositions
of the limit measures. The last two statements identify key equilibrium properties
of the joint distribution of the location and number of the particles that make it to
the next threshold. Specifically, equation (4.16) identifies a kind of local (in time or
stage) stationary distribution on the particle locations after resampling and in the
limit for the controlled IPS system, while (4.17) gives the corresponding distribution
on the number of particles that make it to the next level. These relations, which link
the equilibrium distributions to the limits of the controls, are the main ingredient in
the definition of the local rate function for the IPS process.

                                                m     
Lemma 4.5. Assume Conditions 1.1 and 1.4. Let ¯ξ , η¯m be the admissible control
measures defined by equations (4.1) and (4.2) and let rm and q m be defined by (4.3)
                                                                                                       51


                     ¯ m be the controlled empirical measure defined in (4.6), and
and (4.4). Also, let L
assume that
              Z                                                                  
    .
                        1       m                                            
          ¯
  M = sup E                  R ¯ξ (· |t ) kr (· |t ) + R (¯
                                            m               m         m
                                                          η (· |t ) kq (· |t))) dt < ∞. (4.11)
       m∈N          0

                                 m                 
Then every subsequence of ¯ξ , η¯m , rm , q m , L
                                                ¯ m has a subsubsequence that converges
                                          
in distribution to a limit ¯ξ, η¯, r, q, L
                                         ¯ with the following properties.

                                                                                     
                                      ¯ F,
  1. There exists a probability space Ω,   ¯ such that the limiting quantities ¯ξ, η¯
                                         ¯ P

     can be realized, respectively, as stochastic kernels on [0, ∞)N ×{∆ ∪ [0, ∞)}N ×
                                                   ¯ Furthermore there are stochastic
     [0, 1] and {1, . . . , N } × Y × [0, 1] given Ω.
     kernels ¯ξ (dx × dy |t, ω ) and η¯ (dn × dσ |t, ω ) on [0, ∞)N × {∆ ∪ [0, ∞)}N and
                                                 ¯ such that
     {1, . . . , N } × Y × [0, 1] given [0, 1] × Ω,


                               ¯ξ (dx × dy × dt |ω ) = ¯ξ (dx × dy |t, ω ) dt,

                               η¯ (dn × dσ × dt |ω ) = η¯ (dn × dσ |t, ω ) dt.


  2. Define marginal distributions by


      ¯ξ (A1 |t, ω ) = ¯ξ(A1 × {∆ ∪ [0, ∞)}N |t, ω ),            η¯1 (C1 |t, ω ) = η¯ (C1 × Y |t, ω ) ,
        1


     ¯ξ 2 (A2 |t, ω ) = ¯ξ([0, ∞)N ×A2 |t, ω ),       η¯2 (C2 |t, ω ) = η¯ ({1, . . . , N } × C2 |t, ω ) .

                                         ¯ (dy |x, t, ω ) on {∆ ∪ [0, ∞)}N given [0, ∞)N ×
     Then there exist stochastic kernels Λ
              ¯ and Θ
     [0, 1] × Ω     ¯ (dσ |n, t, ω ) on Y given {1, . . . , N } × [0, 1] × Ω,          ¯
                                                                           ¯ such that P-a.s.
             ¯
     for ω ∈ Ω
              Z                                  Z Z
                    ¯ξ (A1 , A2 |t, ω ) dt =               ¯ (A2 |x, t, ω ) ¯ξ 1 (dx |t, ω ) dt, (4.12)
                                                           Λ
               ZB                                ZB ZA1
                      η¯ (C1 , C2 |t, ω ) dt =             ¯ (C2 |n, t, ω ) η¯1 (dn |t, ω ) dt
                                                           Θ                                      (4.13)
                  B                               B   C1
                                                                                                          52


     for Borel subsets A1 of [0, ∞)N , A2 of {∆ ∪ [0, ∞)}N , C1 of {1, . . . , N }, C2 of
     Y, and B of [0, 1].

                                                                         ¯
3. Statements analogous to Parts 1 and 2 above hold for r and q, and for P-a.s.
             ¯
     for ω ∈ Ω
                                                      Z Z
                            r (A1 , A2 , B |ω ) =                   Λ (A2 |x,t) ¯ξ 1 (dx |t, ω ) dt,   (4.14)
                                                      B       A1


     where Λ (· |x,t ) = ⊗ni=1 γ (· |xi ,t), and with γ defined by (1.9). Also, q (dn, dσ, dt)
     has the form
                                                      Z Z
                              q (C1 , C2 , B |ω ) =                 Θ (C2 |n) η¯1 (dn |t, ω ) dt,      (4.15)
                                                       B       C1


     where Θ (· |n) = ⊗ni=1 δ i ⊗N
                                 j=n+1 Un is the product measure defined in (3.4), and

     Ai , Ci and B take the same form as in Part 2.

   ¯
4. P-a.s.         ¯ we have
          for ω ∈ Ω,
     Z        t                           Z tZ
                  ¯ξ 1 (A |s, ω ) ds =                             ¯ (dσ |gnum (y) , s, ω ) ¯ξ 2 (dy |s, ω ) ds
                                                 δ {h(g(y),σ)} (A) Θ
          0                                0
 Z    t                                   Z tZ
          η¯1 ({n} |s, ω ) ds =                  1{gnum (y)=n} ¯ξ 2 (dy |s, ω ) ds.
  0                                        0


     Therefore, a.s. with respect to Lebesgue measure
                                      Z
                  ¯ξ 1 (A |s, ω ) =                         ¯ (dσ |gnum (y) , s, ω ) ¯ξ 2 (dy |s, ω ) (4.16)
                                          δ {h(g(y),σ)} (A) Θ
                                      Z
          η¯1 ({n} |s, ω ) =              1{gnum (y)=n} ¯ξ 2 (dy |s, ω ) .                             (4.17)


     Furthermore, for any subset C of {1, . . . , N }
                                                          Z    1
                                               ¯
                                               L(C) =              η¯1 (C |t, ω )dt.                   (4.18)
                                                           0
                                                                                        53


Remark 4.6. We suppress ω variable in later discussion from time to time, but keep
in mind that the measures are random.

                         m
Proof. Part 1. Since (¯ξ , η¯m ) → (¯ξ, η¯) in distribution, by the Skorokhod Representa-
                                          m
tion Theorem, we can assume that w.p.1 (¯ξ , η¯m ) ⇒ (¯ξ, η¯) on some probability space
                                                                    
 ¯ F,
(Ω,   ¯ By definition, ¯ξ m ([0, ∞)N × {∆ ∪ [0, ∞)}N × dt) = ¯ξ m (dt) = λ (dt),
    ¯ P).
                                                                        3

where λ is Lebesgue measure on [0, 1]. It follows easily from the convergence that
       
w.p.1 ¯ξ 3 (dt) = λ (dt). Hence by Theorem 0.9, there exists a stochastic kernel
¯ξ (dx × dy |t, ω ) on [0, ∞)N × {∆ ∪ [0, ∞)}N given [0, 1] × Ω
                                                              ¯ such that w.p.1


                        ¯ξ (dx × dy × dt |ω ) = ¯ξ (dx × dy |t, ω ) dt.


The decomposition of η¯ is proved in the same fashion.

   Part 2. The stated decompositions follow from another application of Theorem
0.9.

   Part 3. If (x, t) → γ(·|x, t) were continuous then (x, t) → Λ (· |x,t ) would also
be continuous, and the form of r would follow from the fact that by Lemma 3.2
Λm                                                         ¯m ⇒ ¯ξ. Now in fact
 j (· |x) → Λ (· |x,t) uniformly, and the weak convergence ξ

(x, t) → γ(·|x, t) is not continuous, though under the given conditions it is continuous
in t and x → γ(·|x, t) is always right continuous and left continuous when x is not of
the form (0, 1]∩[Z/q] (see Lemmas 1.5 and 3.2). Thus to obtain the decomposition via
                                                       ¯ ¯ξ m ({[K c ]N }c ) can be made as
Lemma 1.6, it is sufficient to show that under (4.11), E    1     δ

small as desired by choosing δ > 0 sufficiently small, where Kδ = ∪qk=1 ([k/q]−δ, [k/q])
and Kδc = [0, ∞)\Kδ .
   We first claim that it is sufficient to show that given ε > 0,

                                       X X
                                       N m−1
                                 ¯1              ¯m
                                                 Λ
                                 E                 j,n (Kδ ) < ε                    (4.19)
                                   m   n=1 j=0


                      ¯ m is the nth marginal of Λ
for some δ > 0, where Λ                          ¯ m . Using the reasoning that lead
                        j,n                        j
                                                                                                              54


to (4.10), we have

                                                         1 X X ¯ ¯m
                                                           m−1 N
                                                                               
                        ¯ ¯ξ m ({[K c ]N }c ) ≤
                        E                                          P Xj,n ∈ Kδ
                             1     δ
                                                         m j=0 n=1

                                                  1 X X ¯ ¯m
                                                    m−1   N
                                                                        
                                                ≤       N   P Yj,n ∈ Kδ
                                                  m j=0 n=1
                                                                X X
                                                                N m−1
                                                   ¯1
                                                = NE                      ¯ m (Kδ ).
                                                                          Λ j,n
                                                     m          n=1 j=0


                                                 ¯ 1 Pm−1 Λ
   We next show that for each n = 1, . . . , N , E        ¯ m (Kδ ) can be made small,
                                                   m  j=0   j,n

thus (4.19). Let Λm                      m
                  j,n (·|x) = 1[0,1) (x)αj (·|x) + 1[1,∞) (x)δ x−1 (·. Applying once again

Lemma 0.7(a) as in (4.9) and Jensen’s inequality, for any bounded measurable func-
tion f
         (Z                              !           Z                                                   !)
                       1   X
                           m−1
                                                                1   X
                                                                    m−1
                                                                                                   
    ¯
    E         fd                 ¯m
                                 Λ j,n (·)   − log       ef d             Λm
                                                                           j,n   · h g(Y
                                                                                        ¯ jm ), σ
                                                                                                ¯mj
                       m   j=0
                                                                m   j=0
              (                                                             )
                   1   X
                       m−1
                                        m                   
          ¯
         ≤E                    ¯m
                             R Λ j (·) Λj · h g(Y
                                                  ¯ jm ), σ
                                                          ¯mj
                   m   j=0

         ≤ M.


Assume that given ε, there exist δ > 0 so that

                                               X
                                               m−1
                                         ¯1          Λm        ¯m
                                         E            j,n (Kδ |Xj,n ) < ε.                                (4.20)
                                           m   j=0


Substituting into the last display the function f that equals 0 on Kδc and log (1 + 1/ε)
                                                                                       55


on Kδ , we have
                             !
                1 X ¯m
                  m−1
        ¯
        E             Λ (Kδ )
                m j=0 j,n
                                           "                             !#!
                                                     1 X m
                                                       m−1
                     1                                             m
            ≤                      ¯ log 1 + 1
                                M +E                       Λ   Kδ X
                                                                   ¯ j,n
              log (1 + 1/ε)                  ε       m j=0 j,n
                        1
            ≤                  (M + log 2) .
                 log (1 + 1/ε)

This is made as small as desired by choosing ε > 0 small.
   We complete by proving the claim (4.20). Recall that by definition Λm
                                                                       j,n (·|x) =

1[0,1) (x)αm                                        m
           j (·|x) + 1[1,∞) (x)δ x−1 (·), and that αj has a uniformly bounded density

with respect to Lebesgue measure everywhere except (0, 1] ∩ [Z/q] (this follows the
same proof as for γ(·|x, t)). Thus by the right continuity of x 7→ αm         m
                                                                    j (·|x), αj (Kδ |x)

is proportional to δ for all x save those in intervals of the form ([k/q], [k/q] + δ).
                                                ¯m
However, the only way a particular component of X j can be in this set (at least if

δ < 1/q) is if one of its summands in the random walk that ended at that component
was not in the support of the discrete measure. When this happens, the distribution
of this particular component has a bounded density with respect to Lebesgue mea-
sure, and hence the probability that it falls in Kδ is proportional to δ. Therefore,
¯ 1 Pm−1 Λm (Kδ |X
E                 ¯ m ) is itself proportional to δ, which completes the argument.
  m    j=0 j,n      j,n

   The argument for (4.15) is similar but much simpler since the measure is on a
discrete space and is omitted. Here we use that at the prelimit


        q m (dn, dσ, dt) = δ N¯dmte
                               m                         ηm
                                    (dn) Θ (dσ |n) dt = [¯ 1 (dn |t) dt] Θ (dσ |n) .


   Part 4. For t ∈ [0, 1] define

                          bmtc                          bmtc
                   ¯m   1 X                      ¯ m  1 X
                   Jt =        δ  ¯m     m ,     Lt =        δ    ¯m .
                        m j=1 h(g(Yj ),¯σj )          m j=1 gnum (Yj )
                                                                                                       56


Then for bounded and continuous functions u1 mapping [0, ∞)N into R and u2
mapping {1, . . . , N } into R, by definition

       Z                              bmtc          Z bmtc/m Z
                                    1 X                                 m
           u1 (x) J¯tm (dx) =                 ¯ m
                                           u1 Xj =             u1 (x) ¯ξ 1 (dx| s) ds,
                                    m j=1            0

       Z                              bmtc          Z bmtc/m Z
                  ¯ m (dn) =        1 X       ¯ m
                                                  
           u2 (n) L t                      u2 Nj =             u2 (n) η¯m
                                                                        1 (dn| s) ds.
                                    m j=1            0


This implies
                                               Z                          Z                    
                                                     t                        t
               J¯tm          ¯m
                      (dx) , L    (dn) →                  ¯ξ (dx |s) ds,           η¯1 (dn |s ) ds .
                              t                             1
                                                  0                        0


For n ∈ N and j ∈ {0, 1, . . . , m − 1}, we define F¯jm to be the σ-field generated by
  ¯ m, σ
{(Y      m              ¯m                         ¯m       ¯m
    k ¯ k ), k ≤ j} and Gj to be that generated by Fj−1 and Yj . Then

                                         Z
                                m 
              u1         ¯ m
                   h g Yj+1 , σ                              ¯m
                                  ¯ j+1 − u1 (h (g (y) , σ)) Θ        ¯m
                                                              j+1 (dσ)Λj (dy),
                                     Z
                                
              u2   gnum (Y m                     ¯ m (dy)
                         ¯ j+1 ) − u2 (gnum (y)) Λ j


are martingale differences with respect to F¯jm and G¯jm , respectively. Since

  Z                        Z    bmtc/m   Z
                                                                                          m
      u1 (x) J¯tm (dx) −                     u1 (h (g (y) , σ)) η¯m (dσ |gnum (y) , s) ¯ξ 2 (dy |s) ds
                            0
            bmtc         bmtc−1 Z
        1 X      ¯j −
                   m
                      1  X
                                                     ¯ m (dσ)Λ
                                                             ¯ m (dy)
      =       u1 X                u1 (h (g (y) , σ)) Θ j+1     j
        m j=1          m j=0
         bmtc−1           Z                               
        1 X              
                                               ¯ (dσ)Λ
                    ¯ j+1 − u1 (h (g (y) , σ)) Θ
                      m                         m    ¯ (dy) .
                                                       m
      =          u1 X                           j+1    j
        m j=0

Denote the left hand side of the above equalities by Rm . Then Chebyshev’s inequality
                                                                                     57


and a standard conditioning argument yield that for any ε > 0


 ¯ {|Rm | ≥ ε}
 P
                                                                                
          1 bmtc−1
                X              
                                     Z                                      
                                                                                  
      ¯
   = P                    ¯ m                            ¯ m       ¯ m
                       u1 Xj+1 − u1 (h (g (y) , σ)) Θj+1 (dσ)Λj (dy)  ≥ ε   
          m                                                                     
                 j=0
                                                                                 
                    bmtc−1 
                      X                   Z                                       2
       1¯      1                     
   ≤ 2E                      u1 X¯m
                                  j+1 −
                                                                ¯m
                                             u1 (h (g (y) , σ)) Θ        ¯m
                                                                 j+1 (dσ)Λj (dy)
                                                                                   
      ε m      2
                      j=0
                                                                                     
                                                                                 
                 bmtc−1               Z                                      2 
       1¯      1    X              
   = 2E                    u1 X¯m
                                j+1 −
                                                              ¯m
                                           u1 (h (g (y) , σ)) Θ         ¯m
                                                               j+1 (dσ)Λj (dy)
      ε  m2 j=0                                                                  

        4 kuk2∞
    ≤           .
          mε2

                                      ¯ m , and so (4.16) and (4.17) follow. Lastly
A similar calculation can be done for L t
                                  ¯m = L
(4.18) follows from the fact that L    ¯ m.
                                    1


4.3      Proof of the Lower Bound (2.2)
In this section, we introduce some related concepts, identify the rate function and
give the prove for the first part to Theorem 2.2, i.e. (2.2).

Definition 4.7 (Feasible). Probability measures ξ ∈ P([0, ∞)N × {∆ ∪ [0, ∞)}N )
and η ∈ P({1, . . . , N } × Y) are called feasible if the analogues of (4.16) and (4.17)
hold,i.e.,
                                  Z
                     ξ 1 (A) =        δ h(g(y),σ) (A) η (dσ |gnum (y)) ξ 2 (dy)
                                  Z
                    η 1 ({n}) =       1{gnum (y)=n} ξ 2 (dy) ,                    (4.21)


where ξ i and η i are the marginals and η(dσ|n) is defined by η(dn×dσ) = η 1 (dn)η(dσ|n).

                                            .
   For φ ∈ P({1, . . . , N }), define A (φ) = {feasible (ξ, η) such that η 1 = φ}. We
                                                                                                 58


now give the form of local rate function GN in Theorem 2.2:
                           Z                                      Z                              
GN (t, φ) =       inf           R (ξ(·|x) kΛ(· |x,t)) ξ 1 (dx) +       R (η(·|n) kΘ(· |n)) η 1 (dn) ,
              (ξ,η)∈A(φ)
                                                                                             (4.22)
where Λ (A1 × · · · × An |x,t ) = ⊗ni=1 γ (Ai |xi ,t ).
            m       
    Let ¯ξ , η¯m give a value that is within ε > 0 of the infimum in (4.5). Tightness of
    m                 
{ ¯ξ , η¯m , rm , q m } has been proved along any sequence for which the relative entropy
                                                                ˆ m , this can be
costs are bounded. For the purposes of proving a lower bound on W N
                                    ¯ m } follows from the fact that L
assumed without loss. Tightness of {L                                ¯ m takes values

in a compact set. We now evaluate the limit inferior of W  ˆ m along the subsequence
                                                             N
                     m                                 
                          m   m   m  ¯ → ¯ξ, η¯, r, q, L
of n ∈ N for which ¯ξ , η¯ , r , q , L m               ¯ . Then


lim inf Wˆm+ε
          N
 m→∞
                Z 1                                                                             
                             m                                                      
 N m
  ≥ lim inf E ¯              ¯            m
                        R ξ (· |t ) kr (· |t ) + R (¯         m         m
                                                            η (· |t ) kq (· |t )) dt + F , L ¯
        m→∞         0
                                                       
          	
  = lim inf E ¯ R ¯ξ m krm + R (¯        η m kq m )) + F N , L   ¯m
        m→∞
                                     
         	
  ≥ E  ¯ R ¯ξ kr + R (¯    η kq)) + F N , L   ¯
         Z 1                                                                      
                                                                     
 N 
  = E  ¯           ¯
                R ξ(· |t) kr(· |t) + R (¯     η (· |t ) kq(· |t )) dt + F , L   ¯
            0
         Z 1                                      
                                       
 N 
  ≥ E  ¯      GN (t, η¯1 (· |t )) dt + F , L   ¯
            0
          Z 1                                                                       Z 1         
                                      
 N 
  ≥ inf         GN (t, φ (t)) dt + F , ν : φ (·) ∈ P ({1, . . . , N }) , ν =             φ (s) ds .
              0                                                                       0


   The first line of this display is formula (4.5). The second line follows from the
chain rule Lemma 0.7(f). The third line is a consequence of the convergence in
                  m                                       
distribution of ¯ξ , η¯m , rm , q m , L
                                      ¯ m → ¯ξ, η¯, r, q, L
                                                          ¯ , the nonnegativity and lower

semicontinuity of relative entropy, Fatou’s Lemma, and the non-negativity of F N .
The forth line is again the chain rule, and the fifth is due to the almost sure feasibility
implied by (4.16) and (4.17) and the definition of GN . The final line follows from
(4.18).
                                                                                        59


   Note that the proof of the Laplace upper bound naturally provides a candidate
for the rate function, which is defined for ν ∈ P ({1, . . . , N }) by
                                Z    1                            Z   1          
                 IN (ν) = inf             GN (t, φ (t)) dt : ν =           φ (s) ds ,
                                  0                                0


and with GN (t, φ) defined as in (4.22).
   Since ε > 0 is arbitrary, we have proved that every subsequence of the original
          ˆ m , m ∈ N} has a subsubsequence satisfying
sequence {W N


                                                                
       	
                            ˆm≥
                    lim inf W                inf         IN (ν) + F N , ν .
                             N
                     m→∞              ν∈P({1,...,N })


An argument by contradiction applied to an arbitrary subsequence shows that the
                 ˆ m , m ∈ N} satisfies this same lower limit.
entire sequence {W N
Chapter 5

The Large Deviation Lower Bound

In this chapter we prove (2.3), which corresponds to a large deviation lower bound.
To prove this bound, one starts with a near minimizer in the right hand side, and
based on this constructs a control that can be applied in the prelimit stochastic con-
trol representation. A key qualitative property that is used to prove the convergence
of the costs is some kind of ergodicity. As we will see, the nonstandard features
of the problem that made the proof of the large deviation upper bound difficult
complicate the analysis here as well. In particular, each of the following properties
requires the introduction of new techniques and methods of analysis: the transition
kernel on thresholds is in general not Feller; the transition kernels depend both on
the large deviation parameter m and on the threshold index; and the underlying
process includes the absorbing state ∆.
   We begin by discussing the ergodicity properties of the limit transition kernels.


5.1     Some Important Assumptions and Ergodicity
        of the Limit Underlying Transition Kernel
To establish the large deviation lower bound, we need to introduce additional con-
ditions on the underlying measures µ(dy|t) and limit transition kernel γ defined in


                                          60
                                                                                       61


(1.9). Note that since γ contains a cemetery state ∆, the relevant ergodicity prop-
erties are with respect to the kernel that excludes the absorbing set. Note also that
under Conditions 1.1, 1.3 and 1.4, γ (∆ |x, t) is bounded away from 0 and 1, uni-
formly in t ∈ [0, 1] and x ∈ [0, 1). Define a stochastic kernel γ˜ on P ([0, ∞)) given
[0, ∞) × [0, 1] by
                                                         γ (A |x, t)
                                     γ˜ (A |x, t) =
                                                       1 − γ (∆ |x, t)
for Borel sets A ⊂ [0, ∞).
   Some type of transitivity or communication condition of the following sort is
standard in the large deviation theory for the empirical measures of a Markov chain.
Note that the assumption is made for each t due to the nonstationarity of the process,
but under Condition 1.4 the condition holds for one t if and only if it holds for all
t ∈ [0, 1]. Also, we can restrict to initial conditions x ∈ [0, 1) since for initial
conditions x ≥ 1 the dynamics of the chain simply move the process to x − 1 with
probability one.

Condition 5.1 (Transitivity). For each t ∈ [0, 1], there exist positive integers l0 and
n0 such that for all x, ζ ∈ [0, 1)

                        X∞                     X∞
                            1 (i)                  1 (j)
                             i
                               γ˜ (dy |x, t)         γ˜ (dy |ζ, t ) .
                        i=l
                            2                  j=n
                                                   2j
                             0                               0


Here, γ˜ (k) is the k-step transition probability defined recursively by
                                                  Z
                            (k+1)
                       γ˜           (A |x, t) =       γ˜ (A |y, t) γ˜ (k) (dy |x, t)


for Borel sets A, with γ˜ (1) = γ˜ .

   An immediate consequence of Condition 5.1 is the indecomposability [1, Section
7.3] of γ˜ which is shown in the next proposition. The idea of the proof will be used
again in the next section when we show that the IPS transition kernel, involving N
particles, also satisfies a “transitivity”condition.
                                                                                       62


Proposition 5.2. Assume Condition 5.1. Then for each t ∈ [0, 1] γ˜ is indecompos-
able, i.e., there do not exist disjoint Borel subsets A1 and A2 of [0, ∞) such that


          γ˜ (A1 |x, t) = 1 for all x ∈ A1 and γ˜ (A2 |ζ, t ) = 1 for all ζ ∈ A2 .   (5.1)


Proof. We suppress t in the notation since it plays no role in the proof. Assume
(5.1) holds. Then because the dynamics shift x to x − 1 when x ≥ 1, there exist
x ∈ [0, 1) ∩ A1 and ζ ∈ [0, 1) ∩ A2 such that


                            γ˜ (A1 |x) = 1 and γ˜ (A2 |ζ ) = 1.


It then follows from (5.1) that for any l and n in N we have γ˜ (l) (A1 |x ) = 1 and
γ˜ (n) (A2 |ζ ) = 1. According to Condition 5.1, the first of these equalities with l = l0
guarantees that there exists j ≥ n0 such that γ˜ (j) (A1 |ζ ) > 0. Since A1 and A2 are
disjoint, for ζ ∈ A2 this inequality is incompatible with the equality γ˜ (n) (A2 |ζ ) = 1
for any n ∈ N. Hence the indecomposability of γ˜ (· |x ) follows.


   In the proof of the lower bound, ergodicity will be used to derive various limits.
However, owing to the nonstationarity there will be contributions from several dif-
ferent Markov processes, with the final state for one serving as the initial state for
the next. For this reason, we need the following type of condition which ensures that
the bounds coming from each process are independent of the starting state. See the
discussion in [9, Section 8.4].

Condition 5.3 (Uniformity). For any t ∈ [0, 1] and x ∈ [0, 1) there exists a proba-
bility measure β on [0, ∞), k ∈ N, and c > 0 such that for all ζ ∈ [0, 1) satisfying
|x − ζ| < c and for all Borel sets A

                                  X
                                  k
                                        γ˜ (i) (A |ζ, t ) ≥ cβ (A) .
                                  i=1
                                                                                              63


5.2       The Construction of Ergodic IPS Transition
          Kernels
As mentioned in the previous section, the underlying transition kernel γ contains
a cemetery state. In order to introduce additional ergodicity property, we focused
our discussion on γ˜ , a conditioned version of γ by eliminating the cemetery state
∆. In this section, we move on to a group of IPS-type processes, which share the
same transportation measure related to γ but have different resampling mechanisms.
Again, such processes contain a cemetery states {∆}N . However, based on the as-
sumptions of γ˜ in the previous section, the renormalized version of the IPS-type
processes also satisfy transitivity and uniformity conditions. Furthermore, such pro-
cesses are geometrically ergodic.
    Let T1 = [0, ∞)N , T2 = {[0, ∞), ∆}N . Recall that for any admissible control
with finite cost, the transportation control must be supported on T2 \{∆}N , i.e.,
with probability one at least one particle makes it to the next threshold. Hence it is
relevant to study the renormalized transportation measure on T2 \{∆}N defined by


                    ˜ (·|x, t) =            Λ (·|x, t)             ⊗N
                                                                    i=1 γ (· |xi , t)
                    Λ                                       =                           .   (5.2)
                                        1 − Λ({∆}N |x, t)       1 − ⊗Ni=1 γ (∆ |xi , t)


Denote c (x, t) = 1 − ⊗N
                       i=1 γ (∆ |xi , t). Since γ (∆ |x, t) is bounded away from 0 and 1

uniformly in t ∈ [0, 1] and x ∈ [0, 1), c (x, t) is bounded away from 0 and 1 uniformly
                                                    ˜ together with any resampling
in t ∈ [0, 1] and x ∈ T1 \[1, ∞)N . We now consider Λ
                       ˜  Θ, and denote the transition kernel by p˜ (x, · |t) on T1
measure that satisfies Θ
given T1 , i.e.
                                    Z
                  p˜ (x, · |t ) =                           ˜ (dσ |gnum (y)) Λ(dy
                                            δ h(g(y),σ) (·) Θ                ˜    |x, t).   (5.3)
                                    T2 ×Y


Note the ambiguity of the notation: p˜ can be associated with a number of resampling
                                                                                       64


                                                    ˜
measures but always uses the transportation measure Λ.

                                                 ˜ such that Θ
Proposition 5.4. Consider any resampling measure Θ           ˜  Θ, and as-

sume Conditions 1.1, 1.2 and 1.4. Then for each t ∈ [0, 1], p˜ (x, · |t ) has an invariant
measure.

Proof. The main difficulty is due to the fact that the process does not satisfy the
                                                      ˜m
Feller property. For m ∈ N, j = 0, . . . , m − 1, and X 0 = 0, define random variables
˜m
X     ˜m ˜m ˜m
  0 , Y1 , X1 , Y2 , . . . recursively via the requirements

           n                  o                                     
             ˜ m       ˜m              ˜             ˜ m         ˜ m
         P Yj+1 ∈ · Xj = x      =      Λ (· |x,t) , Nj+1 = gnum Yj+1
          n                   o                                             
                      ˜m               ˜             ˜           ˜
        P σm j+1 ∈ · Nj+1 = n   =                      m           m      m
                                        Θ (· |n, t) , Xj+1 = h g(Yj+1 ), σ j+1 .


Let
                                          1 X
                                               m
                                    J˜m =      δ ˜m,
                                          m j=1 Xj

and denote
                                       X m
                      ˜ξ m (dx,dy) = 1                 ˜ (dy |x,t) .
                                          δ ˜ m (dx) × Λ
                                     m j=1 Xj

Then for any bounded and continuous function u : T1 → R,

                Z                                       Z
                                   1 X  ˜ m
                                         m
                                                                      m
                         ˜m
                    u (x) J (dx) =       u Xj =               u1 (x) ˜ξ 1 (dx) .
                                   m j=1

                            ˜ (·|x, t) kΛ(· |x,t)) = − log c (x, t) < ∞ for all x and
Note that by construction R(Λ
       ˜  Θ implies R(Θ(·|n,
t, and Θ               ˜                             ˜ and Θ are probability
                              t) kΘ(· |n)) < ∞ since Θ
                                                     m                  m
measures on a finite set. Thus by Proposition 4.4 {˜ξ } is tight. Let ˜ξ converge in
distribution to a limit ˜ξ along a subsequence. Along that same subsequence obviously


                                   J˜m (dx) → ˜ξ 1 (dx) .


We can then use the martingale difference argument and Condition 1.4 to deal with
                                                                                                   65


the places where the Feller property fails just as in Part 4 of Lemma 4.5, and obtain
          Z                       Z Z
              u (x) ˜ξ 1 (dx) =                                            ˜ (dy |x,t) ˜ξ 1 (dx)
                                                          ˜ (dσ |gnum (y)) Λ
                                          u(h (g (y) , σ))Θ


w.p.1. We now use the facts that there is a countable set of measure determining
functions u to conclude that w.p.1
                                                     Z
                                        ˜ξ 1 (A) =       p˜ (x, A |t ) ˜ξ 1 (dx)


for all Borel sets A, i.e., ˜ξ 1 is an invariant measure of p˜ (x, · |t).

    For t ∈ [0, 1], let π
                        ˜ (dx |t) denote any of the invariant distributions of p˜(x, · |t )
just constructed. The next lemma shows that there is in fact only one such invariant
probability and that the associated Markov process is geometrically ergodic, as well
as other properties that will be needed later.

                                           ˜ such that Θ
Lemma 5.5. Consider any resampling measure Θ           ˜  Θ, and assume

Conditions 1.1, 1.2, 1.4, 5.1 and 5.3. The following conclusion hold for each t ∈
[0, 1].

   1. The Markov process associated with the transition kernel p˜(x, · |t) and initial
                       ˜ (· |t ) is ergodic.
          distribution π

   2. The transition kernel p˜ satisfies a transitivity condition of the same form as γ˜ :
          there exist positive integers n0 and l0

                                  X∞                    X∞
                                      1 (i)                 1 (j)
                                       i
                                                 0
                                         p˜ (x,dx |t)       j
                                                               p˜ (ζ, dζ 0 |t ) ,
                                  i=l
                                      2                 j=n
                                                            2
                                    0                                0


          for x, ζ ∈ [0, 1)N .

   3. The transition kernel p˜(x, · |t ) satisfies the following uniformity condition. Given
          x ∈ [0, 1)N there exists a probability measure β˜ on [0, ∞)N , k˜ ∈ N and c˜ > 0
                                                                                           66


      such that for all ζ ∈ [0, 1)N satisfying kζ − xk ≤ c˜ and for all Borel sets A,

                                    ˜
                                    X
                                    k
                                          p˜(i) (x, A |t) ≥ c˜β˜ (A) .
                                    i=1


      In particular, the precompact set [0, 1)N is a small set [21, (5.14)] of the Markov
      process associated with transition kernel p˜.

   4. The Markov process associated with transition kernel p˜ is geometrically ergodic.

   5. Let A be a Borel set of T1 such that p˜(l0 ) (x0 , A |t ) > 0 for some x0 ∈ T1 and
      l0 ∈ N. Then π
                   ˜ (A |t) > 0.

Proof. To simplify the notation we omit t.
   Part 1. We first show that p˜(x, ·) is indecomposable. Ergodicity then follows
from [9, Theorem A.4.5]. The proof is by contradiction. Suppose that there exist
disjoint sets Ai ⊂ T1 , such that p˜(xi , Ai ) = 1 for all xi ∈ Ai , i = 1, 2. Define

                                 [ [
                                   N
                         Bi =                {xn : xi = (x1 , . . . xN )} ,
                                xi ∈Ai n=1


i.e., Bi is the collection of all possible individual particle locations for joint N -particle
locations xi in Ai . Note that Ai ⊂ {Bi }N , and therefore p˜(xi , {Bi }N ) = 1 for all
xi ∈ Ai . Since γ (∆ |xi ) < 1 for all xi ∈ Bi , we must have γ (Bi |xi ) > 0 for all
xi ∈ Bi .
   First, we claim that γ˜ (Bi |xi ) = 1 for all xi ∈ Bi , i = 1, 2. If not, then without
loss of generality we may assume i = 1, and that there exists z ∈ B1 and C1 ⊂ [0, ∞)
with C1 ∩ B1 = ∅, such that γ (C1 |z ) > 0. Let x
                                                ¯ ∈ A1 have z as its nth component.
Consider the set
                          C1 = {B1 }n−1 ⊗ C1 ⊗ {B1 }N −n ⊂ T1 .
                                                                                       67


By construction C1 ∩ A1 = ∅, but
                             Z
                 x, C1 ) ≥
             p˜ (¯                                       ˜ (dσ |gnum (y)) Λ(dy
                                       δ h(g(y),σ) (C1 ) Θ                ˜    |¯
                                                                                x)
                             C1 ×Y
                                           Y
                                           N
                        ≥ γ (C1 |z )                γ(Bi |¯
                                                          xj )
                                         j=1,j6=n
                        > 0,


which is a contradiction. Hence, γ˜ (Bi |x) = 1 for all x ∈ Bi .
   With this property established, if B1 and B2 were disjoint then we would have a
contradiction to Proposition 5.2. We claim even more, which is that [0, 1)∩Bi , i = 1, 2
are not disjoint. Indeed, if x ∈ B1 ∩ B2 , then it is automatic that x − bxc ∈
[0, 1) ∩ B1 ∩ B2 .
   Thus we may suppose that there exists x0 ∈ B1 ∩ B2 ∩ [0, 1), which, without loss
of generality, is the first component of some xi ∈ Ai , for i = 1, 2. Let


                       k = max {dxi,n e : xi = (xi,1 , . . . , xi,N )} .             (5.4)
                                 i,n


We now construct an event E which satisfies p˜(k) (xi , E) > 0 for i = 1, 2. An event
on path space will be constructed, which will imply the existence of an event of the
desired form. We start at time 0 with particles in locations xi with i = 1, 2. At
each time ` = 0, . . . , k − 1, we require that any particle that is a descendant of the
particle originally in the first location be returned to [0, ∞), i.e., not to ∆. This
can always be done with uniformly positive probability. All other particles are dealt
with as follows:

   • If a particle is in [0, 1) but not a descendant then it moves to the state ∆.

   • If a particle is at x ∈ [1, ∞) then it is moved to x − 1.

Again both events can be realized with positive probability. The definition of k in
(5.4) guarantees that at the end of k steps, all the descendants other than those
                                                                                         68


originate from the x0 will be sent to the cemetery state ∆. In other words, we have
identified an event with strictly positive probability defined in terms of the descen-
dants of the particle originally at x0 , all of which have exactly the same distribution
for the two different initial conditions, and hence such a set E exists.
     Part 2. We now prove that p˜(x, dy) also satisfies a transitivity condition with
the same l0 as in Condition 5.1. That is, we need to show:

                             ¯ 0 < ∞ such that, given any x ∈ [0, 1)N , ζ ∈ [0, 1)N
        there exists a fixed n
       and A ⊂ T1 with p˜(i) (x, A) > 0 for some i ≥ l0 , there is j ≥ n
                                                                       ¯ 0 such that
       p˜(j) (ζ, A) > 0.

     Without loss of generality, we can assume A ⊂ T1 is a rectangle, i.e., A =
A1 × · · · × AN . In this case, for each An there exists yn ∈ {x1 , . . . , xN } such that
γ˜ (i) (An |yn ) > 0 for the same i ≥ l0 . Condition 5.1 implies that for ζ n ∈ [0, 1) there
exists jn ≥ n0 such that γ˜ (jn ) (An |ζ n ) > 0. The problem now becomes finding a
common j for all ζ n .
     Using the same occupation measure argument as in Proposition 5.4, one can
construct an invariant distribution for γ˜ . Let κ
                                                 ˜ denote this measure. We claim that
if γ˜ (i) (A|x) > 0 for some i ≥ l0 , x ∈ [0, ∞) and Borel set A, then κ
                                                                       ˜ (A) > 0. This is
due to the following standard argument. Define the function ι mapping [0, 1) into N
by
                                                                   	
                     ι (ζ) = min j ∈ N : j ≥ n0 and γ˜ (j) (A|ζ) > 0 .

Condition 5.1 guarantees that ι(ζ) is finite for all ζ. Furthermore, ι(ζ) is measurable.
[0, 1) can be written as disjoint union of Borel sets Σ(j) , where


                                  Σ(j) = {ζ : i (ζ) = j} .


                     ˜ (Σ(j) ) > 0 for some j ∈ N satisfying j ≥ n0 . The definition of
It must be true that κ
                                                                                                                                        69


Σ(j) implies that γ˜ (j) (A|ζ) > 0 for all ζ ∈ Σ(j) . Hence,
                             Z                                            Z
                                       (j)
                  κ
                  ˜ (A) =         γ˜         (A |ζ ) κ
                                                     ˜ (dζ) ≥                          γ˜ (j) (A |ζ ) κ
                                                                                                      ˜ (dζ) > 0.
                                                                               Σ(j)


This proves the claim.
   Continuing with the same set A, we note that since γ˜ possesses an invariant prob-
ability measure and satisfies the transitivity condition, the corresponding Markov
chain is ergodic. It follows from the Ergodic Theorem that

                              Z
                                     1 X (i)
                                             n
                                           γ˜ (A |ζ ) κ
                                                      ˜ (dζ) → κ
                                                               ˜ (A) > 0,
                                 A   n i=1

and hence there is k ∈ N such that reasoning then gives
                                              Z
                                                        γ˜ (k) (A |ζ ) κ
                                                                       ˜ (dζ) > 0.                                                    (5.5)
                                                  A


   We now return to the proof of the transitivity. According to the previous para-
graph, for each n there is kn ∈ N such that (5.5) holds with k and A replaced by kn
and An . Also, we can find {c1 , . . . , cN } ∈ NN satisfying


                                  j1 + c1 k1 = · · · = jN + cN kN ≥ n0 .


It then follows that with n
                          ¯ 0 equal to the common value jn + cn kn ,


     p˜(¯n0 ) (ζ, A)
                  Z     Z           Z                                                                                            
               Q
               N                                                                                                                
          ≥                   ···                γ˜   (jn )
                                                              (dz0 |ζ n ) γ˜   (kn )
                                                                                       (dz1 |z0 ) · · · γ˜   (kn )
                                                                                                                     dzcn zcn−1
            n=1     An   An             An

        > 0.


   Part 3. Let d = supx∈[0,1),t∈[0,1] γ(∆|x, t) < 1. Without loss of generality, assume
                                                                                   70


that A is a rectangle, i.e., A = A1 × · · · × AN . Then


                      p˜(i) (ζ, A) ≥ (1 − d)iN ⊗N   ˜ (i) (An |ζ n ) .
                                                n=1 γ


Let k˜ = k, c˜ ∈ (0, c(1 − d)kN ) and β˜ be the product of β. The uniformity property
then follows for small enough but positive c˜ by Condition 5.3.
   Part 4. It is easy to check that p˜ is ψ-irreducible [21, Proposition 4.2.2] by the
uniformity property proved in Part 3. Furthermore, [0, 1)N is a small set of the
process. Geometric ergodicity then follows from [21, Theorem 15.0.1].
   Part 5. The proof is essentially the same as the argument used in Part 2 to show
that γ˜ (i) (A|x) > 0 implies κ
                              ˜ (A) > 0. See also [9, Lemma 8.6.2].


5.3      The Construction of Ergodic Controls
In this section, we will show how to construct controls from the renormalized un-
derlying transition kernel p˜ which inherits the ergodicity property, as in Lemma
5.6. Moreover, how to reconstruct a (state-dependent) random walk from a given
transition kernel of a single particle, as in Lemma 5.7.
   The following lemma tells us that given any target distribution on the number
of successful particles, we can find a control with cost close to the optimum and for
which the associated Markov process is geometrically ergodic and the transportation
cost is bounded as a function of x. Let
                                   Z
                   ˜ ({n} |t ) =
                   φ                                 ˜ (dy |x,t ) π
                                       1{gnum (y)=n} Λ            ˜ (dx |t ) ,   (5.6)


the distribution of the number of success particles associated with the renormalized
                                                                       ˜  Θ, i.e.,
measures of the underlying process model and a resampling distribution Θ
 ˜ Θ,
(Λ, ˜ π
      ˜ ). In the lemma k·kv denotes the total variation norm. Again we suppress
the t variable in the argument.
                                                                                       71


Lemma 5.6. Assume Conditions 1.1, 1.3, 1.4, 5.1 and 5.3 and consider any ϕ ∈
P {1, · · · , N } satisfying GN (t, ϕ) < ∞, where GN is the local rate function defined
in (4.22).

  1. For (ξ, η) ∈ A (ϕ) suppose that


                              R (ξ kξ 1 ⊗ Λ) + R (η kϕ ⊗ Θ) < ∞,


     and recall that by the definition of feasibility ξ 1 = [ξ]1 is an invariant distribu-
                                                           ˜ define Θ
     tion of the associated transition kernel. Let η = ϕ ⊗ Θ        ˜ and let π
                                                                              ˜ be
                    ˜ Θ).
     invariant for (Λ, ˜ Then ξ 1  π
                                    ˜.

                              ¯ ∈ P{1, . . . , N } such that kφ
  2. Given δ > 0 there exists φ                               ¯ − ϕkv ≤ δ, and fur-

     thermore there exists (¯ξ, η¯) ∈ A(φ)
                                        ¯ such that the associated controlled Markov
              ¯ j is geometrically ergodic, and
     sequence X

                                                 
                      ¯ ≤ R ¯ξ ¯ξ ⊗ Λ + R η¯ ¯
                GN t, φ                        φ ⊗ Θ   ≤ GN (t, ϕ) + δ.             (5.7)
                                  1


                                                       k
  3. There exists a sequence of admissible controls (¯ξ , η¯) such that the associ-
     ated transition kernels are geometrically ergodic and the transportation cost
       ¯ k (·|x)kΛ(·|x)) is bounded as a function of x. Furthermore,
     R(Λ

                                                                         
                          ¯ k (· |x) kΛ (· |x)) → R Λ
                        R(Λ                         ¯ (· |x) kΛ (· |x)              (5.8)

             k                        k     k
     and k¯ξ 1 − ¯ξ 1 kv → 0, where ¯ξ = ¯ξ 1 ⊗ Λ
                                                ¯ k , ¯ξ = ¯ξ 1 ⊗ Λ
                                                                  ¯


   We place the proof for Part 3 of the Lemma in the appendix at the end of this
chapter, since it requires a proposition that will be stated later in the next section.

                      ¯ by the decompositions ξ = ξ 1 ⊗ Λ
Proof. Part 1. Define Λ                                 ¯ and Θ
                                                              ˜ defined as in the
                                                                                ¯ Θ)
statement. Let p¯(x, ·) and p˜(x, ·) be the transition kernels associated with (Λ, ˜
                                                                                                 72


     ˜ Θ)
and (Λ, ˜ according to the analogues of (5.3). Then applying Part 4 of Lemma 4.5

and Lemma 0.7(g) to get the first inequality
     Z
             p (x, ·) k˜
          R (¯         p (x, ·)) ξ 1 (dx)
       T1
            Z Z
         =                                                ˜
                                       δ h(g(y),σ) (dx0 ) Θ(dσ           ¯
                                                               |gnum (y))Λ(dy |x)
              T1    T1 ×Y×{T2 \{∆N }}
                        R                                                            #
                    d               δ h(g(y),σ) (·) ˜
                                                    Θ(dσ  |gnum (y))   ¯
                                                                       Λ(dy |x)
                       Y×{T2 \{∆N }}
           × log R                                                              (x0 ) ξ 1 (dx)
                                                    ˜                  ˜
                  d Y×{T2 \{∆N }} δ h(g(y),σ) (·) Θ(dσ |gnum (y))Λ(dy |x)
           Z Z                               ¯ |x) 
                             ¯              dΛ(·
         ≤                  Λ(dy |x) log              (y) ξ 1 (dx)
            T1    T2 \{∆N }
                                              ˜ |x)
                                            dΛ(·
           Z Z                               ¯ |x) 
                             ¯              d Λ(·
         =                  Λ(dy   |x) log            (y) ξ 1 (dx)
            T1    T2 \{∆N }                 dΛ(· |x)
             Z Z                                            
                                ¯              dΛ(· |x)
           +                    Λ(dy |x) log            (y) ξ 1 (dx)
               T1     T2 \{∆N }
                                                 ˜ |x)
                                               dΛ(·
           Z                                        Z
                                      
         =         ¯
               R Λ(· |x) kΛ(· |x) ξ 1 (dx) +            log c (x) ξ 1 (dx)
              T1                                      T1
         < ∞.


Hence the set C = {x ∈ T1 : p¯ (x, ·)  p˜ (x, ·)} satisfies ξ 1 (C) = 1. If we redefine
p¯ (x, ·) to equal p˜ (x, ·) for x ∈ Cc , then ξ 1 is also an invariant measure of the new
transition kernel. Let p¯ also denote this redefined kernel.
    One can show by induction that


                            p¯(l0 ) (x, ·)  p˜(l0 ) (x, ·) for all x ∈ C,


where l0 is the number appearing in Lemma 5.5. Therefore for any Borel set A with
ξ 1 (A) > 0, iterating ξ 1 p¯ = ξ 1 yields
                              Z
                                  p¯(l0 ) (x, A) ξ 1 (dx) = ξ 1 (A) > 0.


Hence there exists a Borel set B such that ξ 1 (B) > 0 and p¯(l0 ) (x, A) > 0 for all
                                                                                      73


x ∈ B. It follows that p˜(l0 ) (x0 , A) > 0 for some x0 ∈ B ∩ C. By part 5 of the
               ˜ (A) > 0, and therefore ξ 1  π
previous lemma π                              ˜.
                                   ˜ by (5.6). For a ∈ (0, 1), the probability measure
                          ˜ define φ
    Part 2. For the given Θ
                 ˜ satisfies kϕa − ϕk = akϕ − φk
ϕa = (1 − a)ϕ + aφ                            ˜ v ≤ 2a. Hence to satisfy the first
                                     v
                  ¯ = ϕa for any a ∈ (0, δ/2]. Next define probability measures for
claim we can take φ
a ∈ (0, δ/2] by


                     ¯ξ = (1 − a) ξ + a˜
                                       π ⊗ Λ,                      ˜ ⊗ Θ.
                                           ˜ and η¯ = (1 − a) η + aφ   ˜            (5.9)


Then ¯ξ 1 = (1 − a) ξ 1 + a˜          ¯ ⊗ Θ.
                           π and η¯ = φ   ˜ We first claim that (¯ξ, η¯) ∈ A(φ),
                                                                             ¯ i.e.,

(¯ξ, η¯) is feasible. Indeed,
             Z
                 δ h(g(y),σ) (A) Θ ˜ (dσ |gnum (y)) ¯ξ 2 (dy)
                     Z                                      n            h     io
                 =                       ˜ (dσ |gnum (y)) (1 − a) ξ 2 + a π
                         δ h(g(y),σ) (A) Θ                                   ˜
                                                                           ˜⊗Λ
                                                                              2

                 = [(1 − a) ξ 1 + a˜
                                   π ] (A)

                 = ¯ξ 1 (A)


and
         Z                               Z            n               h     io
             1{gnum (y)=n} ¯ξ 2 (dy) =   1{gnum (y)=n} (1 − a) ξ 2 + a π  ˜
                                                                        ˜⊗Λ
                                                                             2
                                       h               i
                                     = (1 − a) ϕ + aφ ˜ ({n})

                                       ¯ ({n}) .
                                     = φ


    Let q and p¯ be the transition kernels defined as in (5.3) associated with the
transportation and the resampling controls induced by (ξ, η) and (¯ξ, η¯) (note that p¯
is based on ¯ξ defined in (5.9) and is not the same as in Part 1). Then ξ 1 and ¯ξ 1 are
                                                                                           74


invariant measures of q and p¯, respectively. Furthermore,


                             ¯ξ ⊗ p¯ = (1 − a) ξ ⊗ q + a˜
                                                        π ⊗ p˜.                         (5.10)
                               1                1


   We next verify (5.7). By the convexity of relative entropy

                                            
               ¯ ≤ R ¯ξ ¯ξ ⊗ Λ + R η¯ ¯
         GN t, φ                         φ ⊗  Θ
                           1
                                                                
                  = R (1 − a) ξ + a˜
                                   π⊗Λ ˜ k(1 − a) ξ 1 ⊗ Λ + a˜
                                                             π⊗Λ
                                                                 
                                     ˜⊗Θ
                    + R (1 − a) η + aφ    ˜(1 − a)ϕ ⊗ Θ + aφ ˜⊗Θ

                     ≤ (1 − a) [R (ξ kξ 1 ⊗ Λ) + R (η kϕ ⊗ Θ)]
                           h                                     i
                        +a R π   ˜⊗Λ  ˜ k˜π⊗Λ +R φ      ˜⊗Θ˜
                                                              ˜
                                                               φ ⊗ Θ    .


                                                            ˜  Θ, and finally the
Using the facts that Θ is a measure on a finite space, that Θ
                          ˜ and Λ, we have that
explicit relation between Λ

                                                
                           ˜             ˜  ˜ ˜
                       ˜ ⊗ Λ k˜
                     R π      π ⊗ Λ + R φ ⊗ Θ φ ⊗ Θ < ∞.


Therefore for sufficiently small a > 0,

             
            ¯ ≤
      GN t, φ          inf       [R (ξ kξ 1 ⊗ Λ) + R (η kϕ ⊗ Θ)] + δ = GN (t, ϕ) + δ.
                    (ξ,η)∈A(ϕ)


   We are now left to show the geometric ergodicity of the associated controlled
         ¯ j . In fact, if we can prove that p¯ and p˜ are mutually absolute continuous
sequence X
π                                                                          ¯ j . Since
˜ -a.s., then repeating the proof as in Lemma 5.5 yields the ergodicity of X
¯ξ 1  π
       ˜ and ¯ξ 1 (A) ≥ a˜
                         π (A) for all Borel sets A of T1 , the Radon-Nikodym derivative
f (x) = (d¯ξ 1 /d˜
                 π )(x) satisfies f (x) ∈ [a, ∞) π
                                                 ˜ -a.s. for x ∈ T1 . For any Borel sets A
and B, by (5.10)
          Z                                                   Z
                               ˜ (dx) ≥ a˜
               p¯ (x, B) f (x) π         π ⊗ p˜ (A × B) = a        p˜ (x, B) π
                                                                             ˜ (dx) .
           A                                                   A
                                                                                        75


                  ˜ -a.s. for all x ∈ T1
This implies that π

                                                a
                               p¯ (x, B) ≥          p˜ (x, B) .                     (5.11)
                                              f (x)
                                    R
On the other hand we know that          R(¯       p(x, ·))¯ξ 1 (dx) < ∞, which yields
                                          p(x, ·)k˜


                               p¯ (x, ·)  p˜ (x, ·) , ¯ξ 1 -a.s.                   (5.12)


   Since ¯ξ 1 and π
                  ˜ are mutually absolute continuous, there exists a Borel set C such
that ¯ξ 1 (C) = π
                ˜ (C) = 0 and both (5.11) and (5.12) hold on the complement of C.
Then without loss of generality, we can redefine p¯(x, ·) to equal p˜(x, ·) for x ∈ C.
Therefore, p¯ and p˜ are mutually absolutely continuous π
                                                        ˜ -a.s.
   Part 3. See Section 5.5.

   As noted previously, among the features that distinguish this problem from the
standard theory for occupation measures is the fact that the underlying transition
kernels depend on m. This is a substantial difference, since in the representation
at the prelimit it is these kernels that appear in the relative entropy cost, and so a
control that is designed on the limit transition kernel may not have even a finite cost
for the prelimit kernel. In other words, one cannot in any straightforward way apply
a control that is optimal or nearly optimal for the limit to the prelimit, and instead
it must be carefully adapted so that the costs at the prelimit approximate those at
the limit. Such an adaptation is the concern of the next lemma.
   In this lemma, we will relate perturbations on the transition kernel of a single
particle to perturbations of the increments in the underlying random walk, and vice
versa. In this context, a controlled increment will mean a distribution on increments
that can depend in a Borel measurable way on the entire past of the random walk.

Lemma 5.7. Let αm
                j (·|x) be the rescaled single particle transition kernel defined in
                                                                                                     76


(3.6), and let x ∈ [0, 1).

   1. Given a sequence of controlled increments {ν m
                                                   j,i , i = 0, 1, . . .}, define the process
                                                     m             m
      Z¯j,i
        m
            by Z¯j,0
                 m
                     = j + x and Z¯j,i+1
                                   m
                                         = Z¯j,i
                                             m
                                                 + ¯θj,i , where ¯θj,i has the conditional
      distribution ν m         ¯m
                     j,i . Let σ
                                                 ¯ m / (−1, j + 1)} and τ m
                                 j = min{i ≥ 0 : Zj,i ∈                   j be the

      distribution of Z¯j,¯
                        m
                          σmj
                              . Then

                                                           
                                      ¯m
                                      σ j −1
                                        X                           
                                  ¯                            m m
                                  E            R(ν m
                                                   j,i kµ) ≥ R τ j  αj .
                                        i=0


   2. Conversely, given a controlled transition kernel τ (·|x) that satisfies R(τ kαm
                                                                                    j ) <

      ∞ and dτ (·|x)/dαm
                       j (·|x) ≥ c for some c > 0, one can construct controlled

      increments {ν m                    ¯m                ¯ mσm has distribution τ , and
                    j,i } and a process {Zj,i }, such that Zj,¯ j


                                                m                 
                                                ¯ j −1
                                                σ
                                        m      X
                                   R τ αj = E
                                             ¯        R(ν m       
                                                           j,i kµ ) .
                                                            i=0


Remark 5.8. The conclusion of the previous lemma is also valid for the limit transi-
tion kernel γ (·|x, t). For instance, for each t ∈ [0, 1] and given a controlled transition
kernel τ , there exists a controlled process Z¯i+1 (t) = Z¯i (t) + ¯θi (t) where the ¯θi (t) are
                                                     ¯ (t) = min{i ≥ 0 : Z¯i (t) ≥ 1},
iid with distribution ν i (·|t), and a stopping time σ
such that Z¯σ¯ (t) (t) has distribution τ and
                                                                                  
                                                  σ
                                                  ¯ (t)−1
                                                     X
                                           ¯
                       R (τ kγ (·|x, t)) = E                R(ν i (· |t) kµ (· |t )) .
                                                     i=0


   The next result will also be used in the construction of a nearly optimal control.

Proposition 5.9. The local rate function GN (t, φ) defined in (4.22), i.e.
                           Z                                          Z                              
GN (t, φ) =      inf            R (ξ(·|x) kΛ(· |x,t)) ξ 1 (dx) +           R (η(·|n) kΘ(· |n)) η 1 (dn) ,
              (ξ,η)∈A(φ)
                                                                                                       77


is convex in the second variable.


5.4      Proof of the Upper Bound (2.3)
We now proceed to the proof of the large deviation lower bound (2.3):

                                                          
       	
              ˆm ≤
      lim sup W                   inf              IN (`) + F N , `
                N
      m→∞                   `∈P({1,...,N })
                                                             Z 1                           
                                                                                      
 N 
                        =         inf                R inf          GN (t, φ (t)) dt + F , ` .
                            `∈P({1,...,N }) φ:`=     [0,1]   φ(t)dt    0


Proof of (2.3). The infimum in (2.3) is over all measurable φ. Since such φ are not
regular enough for the construction of a nearly optimal control, we first show how
one can, without loss, assume greater regularity. To simplify notation, we omit the
N dependence from IN and GN .
Step 1: Existence of a near infimizer, `∗ and φ∗ (t), to the right hand side of (2.3).
   Given ε > 0, choose `∗ ∈ P ({1, . . . , N }) such that

                      
                                                  
       	
             I (`∗ ) + F N , `∗ ≤                   inf             I (`) + F N , ` + ε < ∞.
                                              `∈P({1,...,N })


                                                                           R1
For such `∗ and ε there is φ∗ (t), t ∈ [0, 1] with `∗ =                    0
                                                                                φ∗ (s) ds satisfying

                                   Z    1
                                            G (t, φ∗ (t)) dt ≤ I (`∗ ) + ε.
                                    0


Step 2: Discretization of φ∗ (t) in time, φκ (t), without increasing the cumulative rate.
                    Rt
   Now let L (t) = 0 φ∗ (s) ds, so that L (1) = `∗ . For κ ∈ N and t ∈ [0, 1] define a
piecewise linear approximation to L (t) via

                    Z   (i+1)/κ
       φκ (t) = κ                 φ∗ (s) ds if t ∈ [i/κ, (i + 1) /κ) , i ∈ {0, 1, . . . , κ − 1}
                    i/κ
                                                                                                                      78


and                                                              Z    t
                                                Lκ (t) =                  φκ (s) ds.
                                                                  0
                                   R1                 R1
Note that Lκ (1) =                 0
                                        φκ (s) ds =   0
                                                           φ∗ (s) ds = `∗ . The discrete form of φκ (t) will
be used, and we denote φκ (i/κ) = φκi . The convexity of G in the second variable
proved in Proposition 5.9 yields

                   Z                                           Z (i+1)/κ           !
                         1                    1 X
                                                  κ−1
                             G t, L˙ κ (t) dt =       G t, κ              φ∗ (s) ds dt
                     0                          κ i=0           i/κ

                                                X
                                                κ−1 Z (i+1)/κ
                                              ≤               G (t, φ∗ (t)) dt
                                                           i=0       i/κ
                                                          Z  1
                                                  =              G (t, φ∗ (t)) dt.
                                                            0


For the given ε > 0 and κ ∈ N, it follows that
   Z                      
          1                     
                                                           
       	
              G t, L˙ κ (t) dt + F N , Lκ (1) ≤                            inf         I (`) + F N , ` + 2ε.      (5.13)
      0                                                          `∈P({1,...,N })


                                                                                              R1
Step 3: Discrete sum approximation to the cumulative rate                                      0
                                                                                                   G(t, L˙ κ (t))dt. Con-
                                                                        κ
struction of a sequence of near infimum ergodic admissible controls (¯ξ i , η¯κi+1 ).
   For convenience in the presentation, we assume that the infimum in the definition
of G is achieved. If not true, then we could alternatively work with a quantity that
is within ε of the infimum as in step 1. Denote the infimizer in (4.22) at (i/κ, φκi )
by (ξ κi , η κi+1 ). Under Condition 1.4 the measures Λ(· |x, t) for different values of t
are mutually absolutely continuous with uniformly bounded and continuous Radon-
Nikodym derivatives. Thus for all t ∈ [i/κ, (i + 1)/κ)

                                         
                   ˙                  κ 
              G(t, Lκ (t)) − G (i/κ, φi )
                    Z                                                                     
                                                                                          
                ≤  [R (ξ i (· |x) kΛ(· |x, t)) − R (ξ i (· |x) kΛ(· |x, i/κ))] ξ i,1 (dx)
                            κ                         κ                          κ


                < εκ ,
                                                                                                             79

                                                       R1
where εκ → 0 as κ → ∞, and so we can approximate 0 G(t, L˙ κ (t))dt by the sum
1
  Pκ−1          κ
κ  i=0 G (i/κ, φi ). Choose κ large enough that εκ < ε.

   Furthermore, given any δ > 0, Lemma 5.6 implies that for each (ξ κi , η κi+1 ) ∈ A (φκi )
              κ                    κ     
there exist ¯ξ , η¯κ      ¯ κ with ¯
                       ∈A φ          φ − φκ  ≤ δ, and such that the associated
                    i   i+1             i                i        i   v

Markov process is geometrically ergodic with bounded local transportation cost and
          Z                                                      Z
                   κ                       κ                                                    κ
              R ¯ξ i (· |x) kΛ (· |x, i/κ) ¯ξ i,1 (dx) +                                         ¯ (dn)
                                                                      R η¯κi+1 (· |n ) kΘ (· |n) φi


              ≤ G (i/κ, φκi ) + δ.

      .       Pκ−1 ¯ κ         ∗     
If `¯ =   1
                   φ   , then `  − ¯ ≤ δ. Note that by construction, `(0)
                                    `                                  ¯ = 0. Thus,
          κ    i=0 i                   v

we can choose δ ∈ (0, ε) small enough and use the boundedness of F N to obtain

                                            
 N  
 N ∗
                                             F , `¯ ≤ F , ` + ε.


    Combining the above inequalities with (5.13), we have

          1X
            κ−1
                            
                                
       	
                       ¯ κ + F N , `¯ ≤
                G i/κ, φ                    inf          I (`) + F N , ` + 5ε.                            (5.14)
                         i
          κ i=0                         `∈P({1,...,N })


                                                           κ
Step 4: Reconstruction of controlled random walks from (¯ξ i , η¯κi+1 ).
                                           κ
    For each i ∈ {0, 1, . . . , κ − 1}, ¯ξ i (· |x) is a probability measure on the location
of N particles after transportation. Writing this as a product of conditional distri-
butions, we have the decomposition

                                                        Y
                                                        N
                        ¯ξ κ (dy1 , . . . , dyN |x) =         τ κi,n (dyn |x, y1 , . . . , yn−1 ) .
                           i
                                                        n=1


For notational convenience we write τ κi,n (dyn |x, y1 , . . . , yn−1 ) = τ κi,n (dyn |x), since
the dependence on the previously generated particles is not particularly important.
                                                                                                        80


By construction of the ergodic control in (5.9),


                                      ¯ξ κ = (1 − a) ξ κ + a˜   ˜
                                                            π ⊗ Λ,
                                         i             i


where a is chosen according to the parameter δ described in Step 3. Following a
similar discussion to that in Part 2 of Lemma 5.6, we obtain the inequality


                                 ¯ξ κ (B |x) ≥          a ˜
                                    i                       Λ (B |x, i/κ) ,
                                                  fiκ   (x)

                     κ
where fiκ (x) = (d¯ξ i,1 /d˜
                           π )(x), B is any Borel subset of T2 , and π
                                                                     ˜ is the invariant
                             ˜ (· |·, i/κ) and η κi+1 . Let γ κi (·|·) = γ(·|·, i/κ). By the
distribution associated with Λ
              ˜ we then have dτ κi,n /dγ κi > 0 for all i, n. Hence, Remark 5.8 applies.
definition of Λ,
That is, for each single particle transition kernel τ κi,n , one can construct a sequence
of conditional distributions {ν κi,n,k } ⊂ P(R) associated with random walks Z¯i,n,k+1
                                                                               κ
                                                                                       =
            κ
Z¯i,n,k
  κ
        + ¯θi,n,k , Z¯i,n,0
                      κ
                            = Z¯i−1,n,¯
                                κ
                                      σκi−1,n
                                                                 ¯ κi,n = min{k ≥ 0 : Z¯i,n,k
                                              and stopping times σ                      κ
                                                                                              ≥ 1}
satisfying Z¯i,n,¯
             κ
                 σκi,n
                       ∼ τ κi,n (·), such that the relative entropy cost is preserved, i.e.,

                                                                                              
                                                        ¯κ
                                                        σ i,n −1
                                                        X                                    
                                            ¯
         R τ κi,n (· |x) kγ (· |xn , i/κ) = E                      R ν κi,n,k (·) kµ (· |i/κ )  .   (5.15)
                                                         k=0


The random walks are constructed sequentially with the increment distributions

                                                	                                           
             ¯ ¯θκ ∈ · x, Z¯ κ , . . . , Z¯ κ
             P                                      = ν κi,n,k · x, Z¯i,n,0
                                                                       κ
                                                                             , . . . , Z¯i,n,k
                                                                                         κ
                                                                                                 .
                 i,n,k       i,n,0          i,n,k


Again for simplicity we write ν κi,n,k (·|x, Z¯i,n,0
                                               κ
                                                     , . . . , Z¯i,n,k
                                                                 κ
                                                                       ) = ν κi,n,k (·|x).
Step 5: Introduction of large deviation parameter and construction of the prelimit
             m
controls (ˆξ j , ηˆm
                   j ) for the final proof of (2.3).

    Now we introduce a large deviation parameter m ∈ N. We first elaborate on
                                                                          m
the construction of transportation controls which are later denoted as ˆξ j . For n ∈
{1, . . . , N }, i ∈ {0, 1, . . . , κ − 1} and j/m ∈ [i/κ, (i + 1) /κ) define the controlled
                                                                                           81


random walk
                                              m
                      Zˆj,n,k+1
                        m
                                = Zˆj,n,k
                                    m
                                          + ˆθj,n,k , Zˆj,n,0
                                                        m
                                                              = Zˆj−1,n,ˆ
                                                                  m
                                                                        σmj−1,n


with increment distributions
                                                   
       n m                                     o  ν κ (· |x)          k≤σ ˆm
     ˆ ˆθ           ˆm                  ˆ m          i,n,k                  j,n (b)
     P   j,n,k ∈ ·  x, Zj,n,0 , . . . , Zj,n,k  =                                   ,
                                                    µ(·|Zˆ m /m + j/m) otherwise
                                                            j,n,k


      ˆm                      ˆm                   ˆm
where σ j,n (b) = min{k ≥ 0 : Zj,n,k ≤ j(b − 1) or Zj,n,k ≥ 1} with b ∈ (0, 1). This

means that we give up trying to control the process if it is sufficiently negative,
                                                                                 κ
and so the control used differs slightly from the control based completely on ¯ξ i , but
                                                                         m
as we will see momentarily this is needed to control the cost. We let ˆξ j denote
the corresponding transportation kernel over N particles. Also, the corresponding
resampling controls are defined as ηˆm           κ
                                     j (·|n) = η i+1 (·|n) for j/m ∈ [i/κ, (i + 1) /κ).

   As noted in (3.12), given b < 1 there exists a cb < 1 such that for b2 t ≤ s ≤ t/b2

                                                           1
                           cb µ (dy |t ) ≤ µ (dy |s) ≤        µ (dy |t ) .
                                                           cb

Thus, for m big enough such that (jb/m, (j + 1) /m) ⊂ (ib2 /κ, i/(b2 κ)) and k ≤
ˆm
σ j,n (b), we have the bound

                                                                      
                        m
 R ν κi,n,k (· |x) µ · Zˆj,n,k /m + j/m    ≤ R ν κi,n,k (· |x) kµ (· |i/κ) − log cb . (5.16)
                                                                                                    82


Hence one can bound the prelimit transportation cost according to

                                                 m
   m         m             N σ
                              X
                                ˆ j,n (b)−1
                                    X                                          
¯  ˆ                       ¯                   κ                 ˆm
ER ξ j (· |x) Λj (· |x)   ≤ E               R ν i,n,k (· |x) µ · Zj,n,k /m + j/m
                                          n=1     k=0
                                            ˆm
                                          N σ j,n (b)−1
                                          X     X                                            
                                      ¯
                                    ≤ E                     R ν κi,n,k (· |x) kµ (· |i/κ )
                                          n=1     k=0
                                          ¯ σ m (b) log cb
                                       −N Eˆ  j,n
                                                ˆ m (0)−1
                                                σ
                                          X
                                          N   X
                                            j,n
                                                                                         
                                      ¯
                                    ≤ E                     R ν κi,n,k (· |x) kµ (· |i/κ) + εm
                                                                                             b
                                          n=1 k=0
                                                                  
                                      ¯ ¯ξ κ (· |x) kΛ (· |x, i/κ) + εm ,
                                    = ER                                                         (5.17)
                                           i                          b


where εm
       b → 0 as m → ∞. The first inequality is an application of Chain Rule

(Theorem 0.8) and Lemma 5.7(a). Note that the controlled increments after time
ˆm
σ j,n (b) will have no contribute to the relative entropy cost, since, by construction, the

control follows the law of underlying random walk. The second inequality is obtained
                                                      ¯ σ m (b) < ∞ is guaranteed by
from (5.16). The third inequality is true because sup Eˆ  j,n

the finiteness of relative entropy. The last equality is due to (5.15) and the fact that
ˆm
σ                        ¯ κi,n .
  j,n (0) corresponds to σ

    It is useful to review where we are in the construction at this stage. Starting with
a control that is nearly optimal for the limit, we have constructed a control that can
be applied to the increments of individual particles at the prelimit level, and which
leads to nearly the same cost. However, to do so we have had to make a small
change in the distribution of the processes at the threshold crossings. Specifically, in
order to control the costs it was necessary to suspend active control of particles once
that had moved far enough to the left, switch back to the underlying distribution,
and rely on the law of large numbers for the underlying distribution to finish the
construction. This lead to a small change in the distribution of the transition kernel
whose effect will vanish as m → ∞. However, it means that we cannot directly apply
the ergodic theorem as would be traditionally done in the large deviations analysis
                                                                                              83


at this point in the proof, since we no longer have a stationary ergodic process, but
only one whose nonstationary transition kernels are in some sense uniformly close to
those of a geometrically ergodic Markov process. To prove the convergence of the
controlled distributions on the number of success particles, we will need the following
result, whose proof is presented whose proof is presented at the end of this chapter.

Proposition 5.10. Consider j ∈ [i/κ, (i + 1)/κ). Let pˆm             ¯κi (x, dy) be
                                                       j (x, dy) and p
                                           m                    κ
the transition kernels associated with (ˆξ j , η¯κi+1 ) and (¯ξ i , η¯κi+1 ) respectively. Denote
                          ˆm
the empirical measure of {X                                                  ˆm
                            j , i/κ ≤ j/m ≤ (i + 1)/κ} over this interval by Ji and
                                      ¯ κ . Then
the invariant distribution of p¯κi by λ i

                                              
                                       ˆm ¯ κ 
                                      Ji − λi  → 0
                                                     v


in probability as m → ∞.

Step 6: Completion of the proof of (2.3).
   With Proposition 5.10 in hand, we are now ready to complete the proof of (2.3).
The starting point is the representation (3.8). It is easy to check that Steps 1-
                                                                            ˆ m.
5 provide an admissible control sequence, and hence gives an upper bound on W N
                                                        ˆm
Using such controls we define the controlled sequences (X     ˆm
                                                          j , Yj ). For i ∈ {0, . . . , κ−

1}, let
                                          X
                                          m−1
                                                                 j
                                   mi =         1[i/κ,(i+1)/κ)
                                          j=0
                                                                 m

be the number of stages within the continuous time interval [i/κ, (i + 1)/κ), and

                                              X
                                              i−1
                                       ij =         mk + j
                                              k=0


be the index of the jth element at within interval [i/κ, (i + 1)/κ). With these
                                                                                                  84


definitions, we can write
   (                                                                                              )
     1 X h ˆm   ˆ m                                           i
       m−1
 ˆ                             m ˆm                     m       ˆm     ˆm
 E           R ξ j · Xj Λj · Xj              + R ηˆj+1 · Nj+1 Θ · Nj+1
     m j=0
               (
     X
     κ−1
         mi ˆ 1 X
                     mi −1 h                           
                                 m    ˆm  m ˆm
   =         E              R ˆξ ij · X ij   Λ ij  ·  X ij
     i=0
         m       mi j=0
                                                )
                                     i
              κ      ˆm            ˆm
      + R η¯      · N
                  i+1        Θ · N
                               ij+1            ij+1

                      (
        X
        κ−1
            mi            1 X
                              mi −1 h                       
                  ˆ                     ¯κ    ˆm     ˆm
    ≤             E                  R ξ i · Xij Λ · X ij , i/κ
        i=0
              m           mi j=0
                                                     )
                                          i
                              ˆm           ˆm
         + R η¯κi+1        · Nij+1   Θ · N  ij+1    + εm . b


The equality is just due to definitions, while the inequality follows from (5.17).
   Since κ is fixed, mi /m → 1/κ as m → ∞, and it will suffice to obtain upper
bounds on each term of the form
                                  (
                                      1 X
                                         mi −1 h                      
                              ˆ                    ¯ κ   ˆm      ˆm
                      lim sup E                 R ξ i · Xij Λ · Xij , i/κ
                      mi →∞           mi j=0
                                                                             )
                                                                i
                                                       ˆm       ˆm
                                       + R η¯κi+1   · N
                                                       ij+1 Θ · N   ij+1     .            (5.18)


                      ˆm
For each i, the chain X                                           ˆm
                        ij is obtained from the transition kernel ξ ij , which is uniformly
            κ
close to ¯ξ i . It follows from part 3 of Lemma 5.6 and Proposition 5.10 that the limit
                                                   R    κ                  κ
superior of the first term is bounded above by R(¯ξ i (·|x)kΛ(·|x, i/κ))¯ξ i,1 (dx) + ε.
A similar but simpler argument can be applied to the second term to obtain the
combined upper bound
   Z                                                  Z
             κ                       κ                                             κ
        R ¯ξ i (· |x) kΛ (· |x, i/κ) ¯ξ i,1 (dx) +                                  ¯ (dn) + 2ε
                                                          R η¯κi+1 (· |n) kΘ (· |n) φ i


                     κ     ¯κ.
for (5.18), where ¯ξ i,1 = λi
                                                                                         85


  To complete the argument, we will have to compute an upper bound for EhFˆ N, L ˆ m i,
      ˆ m = 1/m Pm δ ˆ m . Since the transportation measures ˆξ m and ¯ξ κ are close
where L           j=1 N             j
                                                                j        i

in total variation norm for j ∈ [i/κ, (i + 1)/κ) by construction, together with Propo-
                                   ˆm
sition 5.10 and the definitions of L      ¯
                                    j and ` for sufficiently large m

                                           D      E 
        
                                               ˆ m ≤ F N , `¯ + ε.
                                         ˆ FN, L
                                         E


We now combine the last four displays and (5.14) to obtain


         ˆm
 lim sup W N
   m→∞
         Z                                          Z
       1        κ                       κ                                      κ
    ≤         ¯                         ¯                                       ¯ (dn)
            R ξ i (· |x) kΛ (· |x, i/κ) ξ i,1 (dx) + R η¯κi+1 (· |n ) kΘ (· |n) φi
       κ
         
 N 
       + F , `¯ + 3ε + lim sup lim sup εm    b
                                        b→1   m→∞

          1   X
              κ−1
                                
       
    ≤                      ¯ κ + F N , `¯ + 4ε
                    G i/κ, φ i
          κ   i=0
                                     
       	
    ≤          inf             I (`) + F N , ` + 9ε.
         `∈P({1,...,N })


Since ε is arbitrary, we have proved the upper bound (2.3).


5.5        Appendix
We finish this chapter by giving the proofs of Lemma 5.7, Proposition 5.9, Proposition
5.10 and Lemma 5.6(3).
Lemma 5.7. Let αm
                j (·|x) be the rescaled single particle transition kernel defined in

(3.6), and let x ∈ [0, 1).

  1. Given a sequence of controlled increments {ν m
                                                  j,i , i = 0, 1, . . .}, define the process
                                                       m             m
        Z¯j,i
          m
              by Z¯j,0
                   m
                       = j + x and Z¯j,i+1
                                     m
                                           = Z¯j,i
                                               m
                                                   + ¯θj,i , where ¯θj,i has the conditional
        distribution ν m         ¯m                ¯ m / (−1, j + 1)} and τ m be the
                       j,i . Let σ j = min{i ≥ 0 : Zj,i ∈                   j
                                                                                       86


      distribution of Z¯j,¯
                        m
                          σmj
                              . Then

                                                   
                                 ¯m
                                 σ j −1
                                    X                          
                              ¯                           m m
                              E           R(ν m
                                              j,i kµ) ≥ R τ j  αj .
                                    i=0


  2. Conversely, given a controlled transition kernel τ (·|x) that satisfies R(τ kαm
                                                                                   j ) <

      ∞ and dτ (·|x)/dαm
                       j (·|x) ≥ c for some c > 0, one can construct controlled

      increments {ν m                    ¯m                ¯ m m has distribution τ , and
                    j,i } and a process {Zj,i }, such that Zj,¯
                                                              σj

                                            m                 
                                            ¯ j −1
                                            σ
                                    m      X
                               R τ αj = E
                                         ¯        R(ν m       
                                                       j,i kµ ) .
                                                     i=0


Proof. Part 1. For fixed T < ∞, consider a modification of ν m                  ¯m
                                                             j,i , such that if σ j < T,

then we redefine ν m           ¯m
                   j,i = µ for σ j < i ≤ T − 1. With this modified definition of

the control, which does not change the distribution of the hitting time on the set
 σm
{¯ j < T }, we have

                                                                        
                  ¯m
                  σ j −1                           σm
                                                  (¯ j ∧T )−1
                     X                               X
              ¯           R(ν m       ≥ E
                                          ¯                    R(ν m       
              E                j,i kµ)                              j,i kµ)         (5.19)
                     i=0                             i=0
                                                                            
                                                               
                                          = R QT {0,...,T −1}  P|{0,...,T −1} ,


where QT denotes the modified joint measure on the space of increments, QT |{0,...,T −1}
denotes the restriction to the first T coordinates in the underlying product space,
and P|{0,...,T −1} denotes product measure with marginal µ.
   We now consider the disjoint, finite partition of the space RT = ATj ∪ BjT ∪ DjT ,
                                                                                      87


where

                                     n                              o
                         ATj =         ω : Z¯j,σ
                                              m
                                                 m ≥ j + 1, σ
                                                 j
                                                            ¯ m
                                                              j < T   ,
                                     n                            o
                         BjT       = ω : Z¯j,σm
                                                 j
                                                         ¯m
                                                 m ≤ −1, σ j < T ,
                                                   	
                         DjT       = ω:σ   ¯mj ≥ T .


Using the approximation property of relative entropy via sums over finite measurable
partitions Lemma 0.7(g), we obtain

                                                              
                         T                              m,T  m,T
                  R Q {0,...,T −1}  P|{0,...,T −1} ≥ R τ j αj       ,           (5.20)


where τ m,T
        j   , αm,T
               j   ∈ P ({(−∞, −1] ∪ [j + 1, ∞) , Σ}) are the measures induced by
QT |{0,...,T −1} and P|{0,...,T −1} , and Σ corresponds to the event σ m
                                                                       j ≥ T.

   We know that αm,T
                 j                         ¯m
                     (Σ) → 0 as T → ∞. Let α                            m
                                             j denote the extension of αj to
                                                                            m,T
                                 ¯m
{(−∞, −1] ∪ [j + 1, ∞) , Σ} with α                ¯m
                                   j (Σ) = 0. Let τ j denote the limit of τ j   , which
must exist by monotonicity. By the lower semi-continuity of relative entropy,

                                                       
                                          m,T         m m
                         lim inf R τ m,T
                                     j   α j    ≥ R τ
                                                     ¯ j  α
                                                          ¯j .                    (5.21)
                          T →∞


If τ¯m                τm
     j (Σ) > 0 then R(¯    αm
                        j k¯                 ¯m
                             j ) = ∞, and if τ                  τm
                                               j (Σ) = 0 then R(¯    αm
                                                                  j k¯
                                                                                 m   m
                                                                       j ) = R(τ j kαj ).

The last sentence and (5.19), (5.20) and (5.20) imply
                                                 
                                ¯m
                                σ j −1
                                   X                            
                          ¯                                m m
                          E              R(ν m
                                             j,i kµ )  ≥ R τ j  αj .
                                   i=0


   Part 2. Define                                                  
                                                      dτ (· |x )
                              U (x; y) = − log                   (y) .
                                                     dαmj (· |x)

Since by assumption U (x; y) is bounded from above, according to [9, Proposition
                                                                                                          88


4.5.1]
               Z                                                       Z               
                                                                    m
     − log         e−U (x;y)
                               αm
                                j   (dy|x) =          inf          
                                                                R β αj + U (x; y) β (dy) .
                                                β:R(βkαm
                                                       j )<∞


Since R(τ kαm
            j ) < ∞, by the definition of U


                                          τ (dy|x) = e−U (x;y) αm
                                                                j (dy|x)                               (5.22)


achieves the infimum in the variational formula, i.e.,
                        Z                                                  Z
                                                             
               − log        e  −U (x;y)
                                          αm
                                           j   (dy|x) = R τ αm
                                                              j +              U (x; y) τ (dy|x) .


   We next consider the analogous relations on the space of increments of the pro-
cess. We have
           Z                                                Z                    
                        ¯ m m (ω)
                   −U x;Z
   − log       e         j,¯
                           σ
                               j                                     ¯ m
                                        P (dω) = inf R (Q kP) + U x; Zj,¯σm (ω) Q (dω) ,
                                                      Q                   j


      ¯m
where σ j denotes the first hitting time of the process and P is the underlying product

probability with marginal µ. The definition of αm
                                                j implies that for any integrable

function K
           Z                       
                                          
                                             ¯ m m (ω)
                                                                      Z
                                        −U x;Z
               K Z¯j,¯
                   m
                     σmj
                         (ω) e                j,¯
                                                σ
                                                  j         P (dω) =       K(y)e−U (x;y) αm
                                                                                          j (dy|x) .


Also, the minimum is achieved at
                                                                   
                                                           ¯ m m (ω)
                                                      −U x;Zj,¯
                                                              σ
                                        Q (dω) = e              j       P (dω) .                       (5.23)
                                                                                                           89


Since the first hitting probability αm
                                     j is induced by P, (5.22) and (5.23) imply


      Z                                      Z                     −U
                                                                            
                                                                                  ¯ m m (ω)
                                                                                              
                                                                                x;Z
          U x; Z¯j,¯
                 m
                   σmj
                       (ω) Q (dω) =                  U x; Z¯j,¯
                                                            m
                                                              σjm (ω) e            j,¯
                                                                                     σ
                                                                                     j            P (dω)
                                               Z
                                           =         U (x; y) e−U (x;y) αm
                                                                         j (dy|x)
                                               Z
                                           =         U (x; y) τ (dy|x) .


Hence, by the chain rule

                               "∞            #     m                  
                                                    ¯ j −1
                                                    σ
            m                 X                    X
       R τ αj = R (Q kP) = EQ    R(ν m
                                      j,i kµ) ≥ E
                                                 Q
                                                           R(ν m       
                                                               j,i kµ ) ,
                                               i=0                          i=0


where {ν m
         j,i } are the controls defined by factoring Q.

   Combining this with the bound proved in part 1, we have constructed controls
and a controlled process {Z¯j,i
                            m
                                } satisfying

                                                       
                                 ¯m
                                 σ j −1
                                     X                       m
                          EQ              R(ν m           
                                               j,i kµ) = R τ αj  ,
                                     i=0


where τ is the distribution induced by the stopped, controlled process. Although
here we have constructed the control on the canonical space R∞ , it can be applied
in the general setting by making the obvious identifications.

Proposition 5.9. The local rate function GN (t, φ) defined in (4.22) is convex in
the second variable.

Proof. By (4.21),
                             Z
                                                          dξ (· |x)
       R (ξ kξ 1 ⊗ Λ) =          ξ (dx, dy) η (dσ |gnum (y)) log      (y),
                                                         dΛ (· |x, t)
                       Z
                                                         dη (· |gnum (y))
      R (η kη 1 ⊗ Θ) =   ξ (dx, dy) η (dσ |gnum (y)) log                   (σ).
                                                         dΘ (· |gnum (y))
                                                                                                         90


Summing these two equalities, we have


  GN (t, φ)
                            Z
                                                                   dξ (· |x)       dη (· |gnum (y))
      =      inf                ξ (dx, dy) η (dσ |gnum (y)) log                (y)                  (σ).
          (ξ,η)∈A(φ)                                              dΛ (· |x, t)     dΘ (· |gnum (y))

Let
                                Ψ (dy, dσ |x, t) = Θ (dσ |gnum (y)) Λ (dy |x, t) ,

and let p be the transition kernel associated with (Λ, Θ). Denote the corresponding
                       ¯ p¯. Then the local rate function G can be written as
controlled versions by Ψ,


GN (t, φ)
                           Z                                                Z                              
                                                                                                    
  = inf         inf                        ¯ (· |x) kΨ (· |x, t) : φ (·) =
                                ξ 1 (dx) R Ψ                                                        ¯
                                                                                              ξ 1 ⊗ Ψ 2 (dy) .
              ξ 1 =ξ 1 p¯                                                     {gnum (y)=·}

                                                      
                                                   ¯ i , i = 1, 2 be within ε > 0 of the infimum.
      For φ1 , φ2 ∈ P({1, . . . , N }), let ξ i1 , Ψ
                                               ¯ 1 +(1 − β) ξ 2 ⊗ Ψ
For β ∈ (0, 1), the convex combination βξ 11 ⊗ Ψ                  ¯ 1 can be decomposed
                                                              1
           ¯ where (ξ 1 , Ψ)
into ξ 1 ⊗ Ψ,             ¯ are defined on the same space as (ξ i , Ψ
                                                                    ¯ i ). Then ξ 1 =
                                                                1

βξ 11 + (1 − β) ξ 21 and
                Z
                                        
                                        ¯ (dy) = βφ1 ({n}) + (1 − β) φ2 ({n}) ,
                    1{gnum (y)=n} ξ 1 ⊗ Ψ 2


Furthermore,
                       Z
                                              ¯ (dy, dσ |x) ξ 1 (dx)
                            δ {h(g(y),σ)} (·) Ψ
                                   Z
                            = β δ {h(g(y),σ)} (·) Ψ   ¯ 1 (dy, dσ |x) ξ 11 (dx)
                                             Z
                                + (1 − β) δ {h(g(y),σ)} (·) Ψ   ¯ 2 (dy, dσ |x) ξ 2 (dx)
                                                                                  1

                            = βξ 11 (dx) + (1 − β) ξ 21 (dx)

                            = ξ 1 (dx) .
                                                                                              91


The convexity of relative entropy then implies


            βGN (t, φ1 ) + (1 − β) GN (t, φ2 ) + 2ε
                                                                            
             ≥ βR ξ 11 ⊗ Ψ   ¯ 1 ξ 1 ⊗ Ψ + (1 − β) R ξ 2 ⊗ Ψ     ¯ 1 ξ 2 ⊗ Ψ
                                    1                        1           1
                                                                               
             ≥ R βξ 11 ⊗ Ψ                        ¯ 1  βξ 11 + (1 − β) ξ 21 ⊗ Ψ
                             ¯ 1 + (1 − β) ξ 21 ⊗ Ψ
                                       
             = R ξ1 ⊗ Ψ    ¯ kξ 1 ⊗ Ψ

              ≥ GN (t, βφ1 + (1 − β) φ2 ) .


Since ε > 0 is arbitrary, convexity follows.


Proposition 5.10. Consider j ∈ [i/κ, (i + 1)/κ). Let pˆm             ¯κi (x, dy) be
                                                       j (x, dy) and p
                                           m                    κ
the transition kernels associated with (ˆξ j , η¯κi+1 ) and (¯ξ i , η¯κi+1 ) respectively. Denote
                          ˆm
the empirical measure of {X                                                  ˆm
                            j , i/κ ≤ j/m ≤ (i + 1)/κ} over this interval by Ji and
                                      ¯ κ . Then
the invariant distribution of p¯κi by λ i

                                              
                                       ˆm ¯ κ 
                                      Ji − λi  → 0
                                                   v


in probability as m → ∞.

              ˆm
Proof. Since {X                                                                 ˆm /
                j } is tight, given η > 0, there is compact S1 ⊂ T1 such that P{Xj ∈

S1 } ≤ η for all j ≤ m, including the index that first exceeds i/κ. Using standard
results from stochastic stability, there is a compact set S2 ⊂ T1 such that, after
starting in S1 , the probability of escaping from S2 is also less than η.
   To simplify the presentation, we denote the initial value of j ∈ [i/κ, (i + 1)/κ) by
zero and the final value by m, and write Jˆm for Jˆim . By the geometric ergodicity of
p¯κi and the compactness of S2

                                                       
                                  κ,(k)          ¯ κ   
                                 p¯i    (x, ·) − λi (·) → 0
                                                         v
                                                                                               92

                                   κ,(k)
uniformly in x ∈ S2 , where p¯i            is the k-fold convolution of p¯κi . Let δ > 0, and
assume that k is large enough that for such x

                                                        
                                  κ,(k)          ¯ κ    
                                  i
                                  p¯     (x, ·) − λ i (·) < δ.                            (5.24)
                                                          v


   Let pˆm                                                ˆm
         j,` denote convolution of the transition kernels p        ˆm
                                                            jk+` , p                ˆm
                                                                     jk+`+1 , . . . p (j+1)k+`−1 ,
                                            ˆm
           ˆ m = x then the distribution of X                      ˆm
so that if X kj+`                             k(j+1)+` is given by p j,` (x, ·). Let ζ > 0

be given. For m sufficiently large we know that

                                  m                      
                                 pˆj (x, ·) − p¯κi (x, ·) < ζ.                           (5.25)
                                                           v


Iterating the (5.25) gives

                                                            
                                m               κ,(k)       
                               pˆj,` (x, ·) − p¯i     (x, ·) ≤ kζ
                                                              v


for all ` and all x ∈ S2 , and thus

                                m                    
                                               ¯ κ (·) < kζ + δ
                               pˆj,` (x, ·) − λ                                           (5.26)
                                                 i     v


for all ` and all x ∈ S2 .
   Let k ∈ N be given. We will break the empirical measure Jˆm into k sums, and it
follows from the fact that k will be fixed as m → ∞ that we can consider just those
m of the form rk, r ∈ N. We then write
                                                 
            X                Xk−1 [m/k]−1
                                    X                   X     [m/k]−1
                                                                X           . 1 X ˆm
            m−1                                         k−1                     k−1
        1                 1                         1     k
  Jˆm =           δX
                   ˆm   =                 δX
                                           ˆ m      =                 δ ˆ m =       J ,
        m   j=0
                   j      m `=0 j=0          kj+`     k `=0 m j=0 Xkj+`       k `=0 `


where the last equality defines the Jˆ`m .
   Now fix ` ∈ {0, 1, . . . , k − 1} and consider any bound measurable function f :
                                                                                           93


T1 → R with kf k∞ ≤ 1. As usual, we have that

                                   Z                         
                          ˆ
                        f Xk(j+1)+` − f (y) pˆm
                            m
                                              j,`
                                                    ˆ
                                                    X m
                                                      jk+` , dy


is a martingale difference sequence, and thus by (5.26)
                        Z             
                 ˆ m              ¯ κ                                       m      m
              f Xk(j+1)+` − f (y) λi (dy) 1{X
                                            ˆm                         } = ε1,j + ε2,j ,
                                                            jk+` ∈S2


where |εm                   m
        1,j | ≤ kζ + δ and ε2,j has conditional mean zero and uniformly (in m and j)

bounded second moment. It follows using a standard calculation that
  (k−1 Z                 Z                          )
   [                              κ                              k
P        f (x) Jˆ`m (dx) − f (y) λ
                                  ¯ (dy) ≥ 2 [kζ + δ]    ≤k               + 2η.
                                  i                        4 [kζ + δ]2 m
    `=0


                                                                                    ¯κ
Letting first m → ∞ shows that weak limits of the Jˆ`m are all within 2 [kζ + δ] of λi

in total variation norm save on a set of probability no more than 2η. We now send
ζ ↓ 0, then δ ↓ 0, and finally η ↓ 0 to complete the proof.

   We are now ready to proof Part 3 of Lemma 5.6.

                       ¯
Proof. Let h(x, y) = (dΛ/dΛ)(x, y), where the Radon-Nikodym derivative is in y.
For k ∈ N define                          Z
                             k
                          S (x) = 1 −            hk (x, y) Λ (dy |x) ,
                                            T2

where hk (x, y) = k ∧ h(x, y). Also, let
                                 Z
                   ¯k
                   Λ (A |x) =                                      ˜ (A |x)
                                     hk (x, y) Λ (dy |x) + S k (x) Λ
                                 A


                                          k
for any Borel set A of T2 . Obviously, (¯ξ , η¯) is also admissible and the associated
transition kernel is equivalent to p¯(x, ·). Hence this transition kernel is also geomet-
                                                                                                94


rically ergodic. Furthermore, the relative entropy
                             Z                                                            
                     
  ¯ k (· |x) Λ (· |x) =
R Λ                                              ¯ k (dy |x) + S k (x) R Λ
                                   log hk (x, y) Λ                        ˜ (· |x)
                                                                                   Λ (· |x)
                              T2


is uniformly bounded as a function of x. Since S k (x) → 0 as k → ∞, the Dominated
Convergence Theorem implies (5.8).
                                                       ¯ k (·|x) − Λ(·|x)k
    It is easy to check that the construction implies kΛ           ¯      v → 0. It then
                                                            k
follows from Proposition 5.10 that the first marginals of ¯ξ converges to ¯ξ 1 in total
                    k
variation, i.e. k¯ξ 1 − ¯ξ 1 kv → 0.
Chapter 6

Asymptotics of the Decay Rate as
N →∞

In this chapter, we assume for simplicity that the increments {θi , i ∈ N} are iid with
θi ∼ µ (dy) and evaluate the asymptotics of IN as N → ∞. The analogous result
holds for the general model used previously, but with more involved constructions
and notation. Under this assumption, it follows that the rate function of interest
coincides with the local rate function. That is, for µ ∈ P ({1, . . . , N }), the rate
function can be written as

                                                                          
                    IN (µ) =         inf      R ¯ξ ¯ξ 1 ⊗ Λ + R (¯
                                                                  η k¯
                                                                     η 1 ⊗ Θ)
                               {¯ξ,¯η}∈A(µ)
                                  
where set A (µ) = {feasible ¯ξ, η¯ with η¯1 = µ}. As established in Theorem 2.2, the
decay rate for the second moment has the expression
                                                              Z         n          
                                    
         V   N
                 = inf R ¯ξ ¯ξ 1 ⊗ Λ + R (¯
                                           η k¯
                                              η 1 ⊗ Θ) − 2         log         η¯1 (dn) .
                                                                         N

   Throughout this chapter we assume that all the conditions of Theorem 2.2 are in
force, and adapt the notation to reflect that fact that the local rate function no longer


                                                95
                                                                                      96


depends on t. Recalling that γ(dy|x) is the transition kernel for a single particle,
it follows from Condition 1.4 and Condition 5.1 that there exists a unique pair of
measures (π, κ) ∈ P([0, ∞)) × P({∆} ∪ [0, ∞)) satisfying

                                                          κ (·)
                         κ = [π ⊗ γ]2 , and π (·) =              .                  (6.1)
                                                       1 − κ (∆)

In other words, κ (∆) is the stationary mass absorbed by ∆ under γ, and if we
renormalize the mass that remains then π is the invariant probability for the resulting
transition kernel.
   It is easy to check that the large deviation rate defined in (1.5) is exactly γ =
− log(1 − κ (∆)), and so the best possible rate of convergence for the Monte Carlo
scheme is −2 log(1 − κ (∆)). By Jensen’s inequality V N ≤ −2 log(1 − κ (∆)), and
it is not hard to show that the inequality is in general strict. The following result
shows that this rate is achieved when the number of particles tends to infinity.

Theorem 6.1. V N → −2 log(1 − κ (∆)) as N → ∞.

   We will study the asymptotics of V N using many of the same weak convergence
arguments used previously. We first recall the following factorization of the under-
lying transportation and resampling distributions:


            Λ (dy| x) = ⊗N
                         i=1 γ (dyi |xi )
                                                                               
                      = ⊗N
                         i=1 1[0,1] (xi )α (dyi |xi ) + 1[1,∞) (xi )δ xi (dyi )

            Θ (dσ| n) = ⊗ni=1 δ i (dσ i ) ⊗N
                                           i=n+1 Un (dσ i ) .
                                                                                                  97


As in Lemma 4.5 we can factor the controls according to


       ¯ξ (dx, dy) = ⊗N ¯ξ i (dxi , dyi |x1 , y1 , . . . , xi−1 , yi−1 ) = ⊗N ¯ξ i (dxi , dyi )
                      i=1                                                   i=1

          ¯ξ 1 (dx) = ⊗N ¯ξ i (dxi |x1 , . . . , xi−1 ) = ⊗N ¯ξ i (dxi )
                       i=1 1                               i=1 1

       η¯ (dσ |n) = ⊗N     ¯i (dσ i |n, σ 1 . . . , σ i−1 ) ⊗ni=1 δ i (dσ i )
                     i=n+1 η

                     = ⊗N     ¯i (dσ i |n) ⊗ni=1 δ i (dσ i ) ,
                        i=n+1 η


where the right hand side indicates the variables for which dependence is suppressed
in the notation. With an abuse of notation we sometimes write η¯i (dσ i |n) for δ i (dσ i )
even when i ≤ n. Note that the controls all depend on N , but we also omit this de-
pendence in the notation. The chain rule for relative entropy and Jensen’s inequality
then yields
                       " N                                         #
                      X  i                                   
                                            i
    R ¯ξ ¯ξ 1 ⊗ Λ = E
                     ¯     R ¯ξ (dx, dy) ¯ξ 1 (dx) ⊗ γ (dy |x )
                                 i=1
                                 "                                                  !#
                                        1 X ¯i
                                           N             1 XN
                          ¯ R                                    i
                                                                ¯ξ (dx) ⊗ γ (dy |x )
                       ≥ NE                   ξ (dx, dy)         1
                                        N i=1            N
                                                            i=1


      ¯ indicates integration with respect to the suppressed variables. Since the
where E
total cost is bounded for all N , we conclude that
                                                                 !
                     1 X ¯i
                        N             1 XN
                                              i
                                             ¯ξ (dx) ⊗ γ (dy |x )
               R           ξ (dx, dy)         1                    → 0
                     N i=1            N
                                         i=1


in probability. Tightness follows as in Chapter 4, and hence for any subsequence
there exists a subsubsequence such that

                                          1 X ¯i
                                             N
                                               ξ →π
                                                  ¯
                                          N i=1 1
                                                                                              98


in probability. Thus
                                       1 X ¯i
                                          N
                                             ξ →π
                                                ¯ ⊗ γ.                                     (6.2)
                                       N i=1


6.1           Proof of Theorem 6.1
For any feasible (¯ξ, η¯) controls, let {(X
                                          ¯ iN , Y¯iN ), i = 1, 2, . . . , N } be processes with
                                                                           ¯N, k =
the given joint distributions. Define F¯iN to be the σ-field generated by {X k

1, 2, . . . , i − 1} and G¯iN generated by {(X
                                             ¯ N , Y¯ N ), k = 1, 2, . . . , i − 1}. Then for any
                                               k     k

bounded continuous function f
                                 Z                                    
                                           i
                             ¯            ¯
                          f Xi − f (x) ξ 1 (dx) , i = 1, 2, . . . , N
                                 Z                                       
                                         i
                          f Y¯i − f (y) ¯ξ (dx, dy) , i = 1, 2, . . . , N


are martingale difference sequences, with respect to F¯iN and G¯iN , respectively. This
implies
                                                  !
                               1 X        1 X ¯i
                                 N           N
                                    1¯ ,        ξ   → (¯
                                                       π, π
                                                          ¯)
                              N i=1 Xi N i=1 1
                                      "         #!
                            1 X         1 X ¯i
                               N           N
                                 1¯ ,         ξ     → (¯
                                                       κ, κ
                                                          ¯) .
                            N i=1 Yi N i=1
                                                      2


in distribution, and by (6.2) κ
                              ¯ satisfies


                                           κ    π ⊗ γ]2 .
                                           ¯ = [¯                                          (6.3)


   Interpreting gloc (y) as vector valued with components gloc,i (y) that identify the
locations from which the resampling controls can select, we have
                   Z                                                        
                                                    N
      ¯ f X
      E   ¯i −          f gloc,σ (Y ) η¯ dσ gnum Y
                                  ¯N        i     ¯ N   Y
                                                        
                                                             ¯1, . . . , X
                                                          ¯ ,X           ¯ i−1 = 0.
                                                                                          99


This implies the bound
                                                                                !2 
              X N                XN Z
            1              1                                             
        ¯
        E          f X ¯i −           f gloc,σ (Y¯ N ) η¯i dσ gnum   ¯
                                                                      Y   N         
            N i=1           N i=1
              "      N             Z                                            2 #
                 1  X                                                    
            ¯
          = E             f ¯
                            X i   −   f gloc,σ (Y¯ N ) η¯i dσ gnum   ¯
                                                                      Y   N
                N 2 i=1
           ≤ 2 kf k2∞ /N → 0.                                                           (6.4)


If we can show that

                  N Z
               1 X                                      
                                                       ¯N → κ¯ (dx)
                      δ gloc,σ (Y          ¯i dσ gnum Y
                                ¯ N ) (dx) η                          ,                 (6.5)
               N i=1                                        1−κ ¯ (∆)

in probability then (6.4) implies

                            1 X
                               N
                                          κ
                                          ¯ (dx)
                                  1X¯i →           =π
                                                    ¯.
                            N i=1        1−κ ¯ (∆)

By uniqueness of the invariant distribution under the renormalized transition kernel
and (6.1), we conclude
                                    (¯
                                     π, κ
                                        ¯ ) = (π, κ) .

Then Theorem 6.1 will follow from the fact that by Fatou’s Lemma


    lim inf V N
    N →∞
                  "                                                             !#
                                                              1 X
                                                                 N
                                                                             
      ≥ lim inf R ¯ξ ¯ξ 1 ⊗ Λ + R (¯
                                    η k¯          ¯ log
                                       η 1 ⊗ Θ) − E2                1[0,∞) Y¯i
           N →∞                                               N i=1
      ≥ −2 log (1 − κ (∆)) .


Note that in fact we will have shown that along the minimizing sequence

                                      
                         R ¯ξ ¯ξ 1 ⊗ Λ + R (¯
                                             η k¯
                                                η 1 ⊗ Θ) → 0.
                                                                                                      100


   It remains to show that (6.5) is true. This is a consequence of the finiteness of
the rate. In particular,


   η k¯
R (¯  η 1 ⊗ Θ)
           "Z            "                    # "                             #!          #
                     1       X
                             N                  1                  X
                                                                    n
         ¯                                     
  ≥ NE         R                   η¯i (dσ |n)    (N − n)Un (dσ) +   δ i (dσ)    η¯1 (dn)
                     N                         N
                             i=1                                            i=1


for all N . Therefore

                            N Z
                         1 X                                      
                                δ gloc,σ (Y         ¯i dσ gnum Y
                                          ¯ N ) (·) η           ¯N
                         N i=1

has the same limit in probability as
                                                                                                           
          ¯N) Z
    gnum (Y                                                       Z
1     X                                                                                                   
                                                          ¯N ]
                            ¯ N ) (·) δ i (dσ) + [N − gnum Y
                  δ gloc,σ (Y                                         δ gloc,σ (Y
                                                                                ¯ N ) (·) Ug      ¯ N ) (dσ) .
                                                                                             num (Y
N      i=1


This completes the proof of Theorem 6.1.
    Part III

Numerical Results


       101
Chapter 7

Computational Results

In this chapter, we present some computational results for interacting particle scheme
under a simple one dimensional random walk setting.


Problem Setup and Notations

Let {Xj , j = 0, 1, . . .} be a sequence of i.i.d. random variables taking values in a
discrete state space {−1, 1, 2} with probability {0.6, 0.3, 0.1} respectively. For each
fixed n ∈ N, let Sn denote the sum X0 + X1 + · · · + Xn−1 . Let m ∈ N be the
large deviation parameter, i.e., the number of thresholds. Define the stopping time
σ m = min{n : Sn ≥ m, or Sn ≤ −1}. Then the quantity of interest defined in (1.3)
becomes
                             pm = P {Sσm ≥ m| S0 = 0} .

The importance sampling estimates of pm for different values of m are given in Table
7.1 with sample size 100, 000, 000. For simplicity, we take these approximations as
the true values of pm and compare the results from IPS scheme to them.
   Throughout this chapter, we use N as the number of particles of the IPS estimator
and K as the sample size. {ˆ
                           pi , i = 0, . . . , K−1} represents the collection of K number


                                          102
                                                                               103


                      m            pm        m          pm
                      10      3.709502e-002 60     1.617500e-005
                      20      6.997822e-003 70     3.710000e-006
                      30      1.503110e-003 80     7.680000e-007
                      40      3.320130e-004 90     1.560000e-007
                      50      7.368103e-005 100    3.996866e-008

       Table 7.1: estimates of pm for different m using Importance Sampling

of IPS samples and
                                       1 X
                                         K−1
                                             pˆi
                                       K i=0

is the IPS estimate of pm .


The Large Deviation Rate of the Second Moment

In Figure 7.1, we compare the large deviation rate of the second moment of the
IPS estimators for different m and N . In each case, the sample size K is taken
to be 10, 000. The large deviation rate of the second moment is approximated by
computing                               "        #
                                         1 X 2
                                           K−1
                                   − log       pˆ .
                                         K i=0 i

The black dashed line represents the asymptotic optimal rate −2 log pm which is ap-
proximately an affine function of m for large m. Here pm are evaluated by important
sampling. The theoretical results from previous chapter indicate that IPS scheme is
asymptotic suboptimal and the large deviation rate of its second moment is a func-
tion of N . However, as the number of particles N goes to ∞, IPS scheme becomes
asymptotic optimal.
   As one can see in Figure 7.1, the large deviation rate of IPS converges to the
asymptotic optimal rate as N grows, which supports the theoretical results.
                                                                                                                104


                        large deviation rate of the second moment of IPS estimator for different N
           35
                    Asymptotic Optimal Rate
                    N=5
                    N = 10
                    N = 50
                    N = 10
           30       N = 250
                    N = 500
                    N = 1000
                    N = 10000


           25
  pˆ2i )
 PK −1
     i=0


           20
  − log( K
         1


           15


           10


             5
              10   20          30         40          50         60         70          80           90   100
                                                           m


Figure 7.1: large deviation rate of the second moment of the interacting particle
system estimator for different m and N
                                                                                  105


Trade-off Between N and K

Another interesting question to ask related to IPS estimator is the trade-off between
the number of particles N and number of samples K. We illustrate the problem in
two aspects:

  1. The empirical standard error:
                                     v
                                     u
                                     u     1 X
                                               K−1
                                σN = t              pi − pˆ)2
                                                   (ˆ
                                 K
                                         K − 1 i=0


     where pˆ is the sample mean of IPS estimates.

  2. The per sample standard error:
                         v                    !
                         u                            r
                         u        1 X
                                    K−1
                                                          var (ˆ
                                                               pi )   σN
                         tvar           pˆi       =                 ≈ √K .
                                  K i=0                     K          K


To begin with, we fixed the computational efforts. That is, we let the total number
of simulated paths to be a constant, i.e. N × K = constant. The constant is taken
to be 1, 000, 000 in our simulation.
   Figure 7.2 and 7.3 plot the decaying of the empirical standard error against the
increasing of number of particles N for m = 10 and m = 50 respectively. Note that
the sample size K is also varying since we keep the total computational effort fixed.
N ranges from 5 to 100 with increment of size 1.
   Figure 7.4 compared the changes of per sample standard error to the changes N .
The total range of N is within [5, 10000]. The top left graph plots the changes of
    √
σN
 K /  K for N ∈ {5, 6, . . . , 100}. As one can see, there is a significant per sample
standard error reduction for N less than 40. The number of particles in the top left
graph, which is part of the bottom graph, ranges from 5 to 1, 000 with increment of
size 5. It shows that there is no per sample standard error reduction for N greater
                                                                                                    106

                                                            √
than 100. The bottom graph shows the overall trends of σ N
                                                         K / K.

   In conclusion, the computational results suggest that the number of particles N
should be chosen to be around 40 for our problem.


                                          K/N trade−off with fixed effort, m = 10
                       0.045


                        0.04


                       0.035
                 pk − pˆ)2


                        0.03
                (ˆ


                       0.025
               P
                K −1
                 1


                        0.02
              q


                       0.015


                        0.01


                       0.005
                            0   10   20      30     40      50     60      70       80   90   100
                                                            N


Figure 7.2: trade-off between N and K with fixed effort, N × K = 1, 000, 000, m =
10


General Results

In this section, we present some basic results of IPS scheme under the i.i.d random
walk setting with m = 10 and 50. Figure 7.5 and 7.6 plot the changes of sample
error
                                             1 X
                                               K−1
                                                   pˆi − pm
                                             K i=0

for fixed sample size K = 10, 000, where pm is the true value obtain from importance
sampling. It appears that the sample error is asymptotic independent of N for large
N which is what we expected.
   For fixed sample size K = 10, 000, Figure 7.7 and 7.8 show that the empirical
                                                                                                            107


                                  −4             K/N trade−off with fixed effort, m = 50
                               x 10
                           2

                         1.8

                         1.6

                         1.4


                   pk − pˆ)2
                         1.2
                  (ˆ
                 P
                           1
                  K −1
                   1
                 q


                         0.8

                         0.6

                         0.4

                         0.2
                            0          10   20      30     40      50     60        70     80   90   100
                                                                   N


Figure 7.3: trade-off between N and K with fixed effort, N × K = 1, 000, 000, m =
50


standard error decreases as particle size N grows.
   Figure 7.9 plots the changes of empirical standard error with different sample
size K while N is fixed. As expected, the estimates are roughly a constant.
   We predict the rate of decay of the second moment of IPS scheme in N is of order
1/N . If true, the statistic
                                                             1
                                                                 PK−1
                                                  N          K      i=0      pˆ2i
                                                    log                                                    (7.1)
                                                  m               (pm )2
should be a constant of N . However, such statistic requires m to be comparably
large to N . In Figure 7.10, we can’t not see such trend because m is too small. In
Figure 7.11, one can see that for N less than 1, 000, the quantity (7.1) is roughly a
constant. However, since sample error is a constant of K, for fixed K (7.1) amplifies
such error by size N . Hence, when N grows large in Figure 7.11, we see a increasing
oscillation of (7.1). One of our future work will be theoretically proving such decay
rate in N .
                                                                                                                                 108


                       K/N trade−off with fixed effort, per sample standard error, m = 10
                  −5                                                                −5
              x 10                                                            x 10
        9.5                                                             9.5


         9                                                               9
    K


                                                                  K
   √K


                                                                 √K
        8.5                                                             8.5
   σN


                                                                 σN
         8                                                               8


        7.5                                                             7.5
              0        20      40       60          80     100                0          200      400       600   800    1000
                                    N                                                                   N

                  −5
              x 10
        10


         9
    K


         8
   √K
   σN


         7


         6
              0         1000    2000         3000        4000    5000             6000         7000     8000      9000   10000
                                                                  N


Figure 7.4: trade-off between N and K with fixed effort, N × K = 1, 000, 000, m =
10
                                                                                                 109


                        −4                         error, m = 10
                 x 10
             7

             6

             5

             4

   pˆk − p   3

             2

   K
   1


             1

             0

          −1

          −2

          −3
                 0                 200        400              600           800          1000
                                                        N


                     Figure 7.5: sample error, m = 10, fixed K


                      −6                           error, m = 50
                 x 10
             1

          0.5

             0

      −0.5

          −1
pˆk − p


      −1.5

K
1


          −2

      −2.5

          −3

      −3.5

          −4
                 0           500     1000   1500      2000         2500   3000     3500   4000
                                                       N


                     Figure 7.6: sample error, m = 50, fixed K
                                                                                           110


                                       standard error, m = 10
       0.045

         0.04

       0.035

         0.03
  pk − pˆ)2
       0.025
 (ˆ


         0.02
 K−1
  1


       0.015

         0.01

       0.005

               0
                   0          200        400          600              800          1000
                                                  N


               Figure 7.7: standard error, m = 10, fixed K


                   −4                standard error, m = 50
           x 10
  pk − pˆ)2
 (ˆ


       1

 K−1
  1


       0
           0            500   1000   1500      2000   2500      3000         3500   4000
                                                N


               Figure 7.8: standard error, m = 50, fixed K
                                                                                    111


                               −4             standard error, m = 10
                           x 10
                      14


                      12


          pk − pˆ)2
                      10


         (ˆ
        

                       8
            K−1
             1
       
               =


                       6
             N
            σK


                       4


                       2
                           0          200       400            600     800   1000
                                                          K


                      Figure 7.9: standard error, m = 10, fixed N


                                            decay rate in N, m = 10
              1


            0.9


            0.8
pˆ2k


            0.7
  k
       p2

K
1
  log


            0.6
 m
 N


            0.5


            0.4


                  0                 200       400             600      800   1000
                                                      N


               Figure 7.10: decay rate in N, m = 10, fixed K
                                                                                   112


                                    decay rate in N, m = 50
             0.8


             0.6


             0.4
pˆ2k


             0.2
  k
       p2

K
1
  log


              0
 m
 N


            −0.2


            −0.4


            −0.6
                   0   500   1000   1500     2000    2500     3000   3500   4000
                                              N


              Figure 7.11: decay rate in N, m = 50, fixed K
Bibliography

[1] L. Breiman. Probability Theory. Addison-Wesley, Reading, Mass., 1968.

[2] T. Dean and P. Dupuis. Splitting for rare event simulation: A large deviations
   approach to design and analysis. Stoch. Proc. Appl., 119:562–587, 2009.

[3] T. Dean and P. Dupuis.          The design and analysis of a generalized
   RESTART/DPR algorithm for rare event simulation. Math. of Operations Re-
   search, page to appear, 2010.

[4] P. Del Moral and J. Garnier. Genealogical particle analysis of rare events.
   Annals of Applied Probability, 15:2496–2534, 2005.

[5] P. Del Moral and P. Lezaud. Branching and interacting particle interpreta-
   tions of rare event probabilities. In Stochastic Hybrid Systems, Lecture Notes in
   Control and Information Science, volume 337. Springer, Berlin, 2006.

[6] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications.
   Jones and Bartlett, Boston, 1993.

[7] M.D. Donsker and S.R.S. Varadhan. Asymptotic evaluation of certain Markov
   process expectations for large time, I. Comm. Pure Appl. Math., 28:1–47, 1975.

[8] M.D. Donsker and S.R.S. Varadhan. Asymptotic evaluation of certain Markov
   process expectations for large time, II. Comm. Pure Appl. Math., 28:279–301,
   1975.


                                       113
                                                                                 114


 [9] P. Dupuis and R. S. Ellis. A Weak Convergence Approach to the Theory of
    Large Deviations. John Wiley & Sons, New York, 1997.

[10] P. Dupuis and H. Wang. Importance sampling, large deviations, and differential
    games. Stoch. and Stoch. Reports., 76:481–508, 2004.

[11] P. Dupuis and H. Wang. Dynamic importance sampling for uniformly recurrent
    Markov chains. Ann. Appl. Prob., 15:1–38, 2005.

[12] P. Dupuis and H. Wang. Subsolutions of an Isaacs equation and efficient schemes
    for importance sampling. Math. Oper. Res., 32:1–35, 2007.

[13] P. Dupuis and O. Zeitouni. A nonstandard form of the rate function for the
    occupation measure of a Markov chain. Stoch. Proc. and Their Appl., 61:249–
    261, 1996.

[14] M.J.J. Garvels. The Splitting Method in Rare Event Simulation. PhD thesis,
    University of Twente, The Netherlands, 2000.

[15] M.J.J. Garvels and D.P. Kroese. A comparison of RESTART implementation.
    IEEE Publishers, Washington, DC, 1998.

[16] P. Glasserman, P. Heidelberger, P. Shahabuddin, and T. Zajic. A large devia-
    tions perspective on the efficiency of multilevel splitting. IEEE Trans. Automat.
    Control, 43:1666–1679, 1998.

[17] P. Glasserman, P. Heidelberger, P. Shahabuddin, and T. Zajic. Multilevel split-
    ting for estimating rare event probabilities. Operations Research, 47:585–600,
    1999.

[18] Z. Haraszti and J.K. Townsend. The theory of direct probability redistribution
    and its application to rare event simulation. ACM Transactions on Modelling
    and Computer Simulation, 9:105–140, 1999.
                                                                                 115


[19] H. Kahn and T.E. Harris. Estimation of particle transmission by random sam-
    pling. National Bureau of Standards Applied Mathematics Series, 12:27–30,
    1951.

[20] H. J. Kushner. Weak Convergence Methods and Singularly Perturbed Stochastic
    Control and Filtering Problems, volume 3 of Systems and control. Birkhaeuser,
    Boston, 1990.

[21] S.P. Meyn and R.L. Tweedie. Markov Chains and Stochastic Stability, Second
    Edition. Cambridge University Press, New York, 2009.

[22] M. Villen-Altamirano and J. Villen-Altamirano. RESTART: A method for ac-
    celerating rare event simulations. In Proc. of the 13th Iternational Teletraffic
    Congress, Queueing, Performance and Control in ATM, pages 71–76, Elsevier,
    Amsterdam, 1991.

[23] M. Villen-Altamirano and J. Villen-Altamirano. RESTART: A straightforward
    method for fast simulation of rare events. In Proc. of the 1994 Winter Simulation
    Conference, pages 282–289, 1994.

[24] M. Villen-Altamirano and J. Villen-Altamirano. Analysis of RESTART sim-
    ulation: Theoretical basis and sensitivity study. European Transactions on
    Telecommunications, 13:373–385, 2002.
Index

∆, 23                      ˜ ({n} |t), 70
                           φ
Λm
 j , 33                    ˜ (dx |t ), 65
                           π
Θ, 33
                           GN (t, φ), 58, 76
αm
 j (A |x ), 36

αm                         I (ν), 59
 j (dy |x ), 31
¯ m , 35
Λ                          A (φ), 57
  j
¯m
Θ                          Cjm , 20
 j , 35

η¯m (C1 × C2 × B), 43      F N , 28

η¯m (dn, dσ |t ), 43       H(x, α), 18
¯ξ m (A1 × A2 × B), 43     L(x, β), 19
¯ξ m (dx, dy |t ), 43      Lm , 22

˘m                         N , 20
α j (dy |x ), 32

γ (A |v, t ), 24           Njm , 20

¯m
σ                          Un , 33
  j , 35
                            m
σ, 34                      Xj,n , 20, 33
                             m
µ, 18                      Yj,n , 20

ρ, 19                      Zim , 18
                           ¯ m , 44
                           L
σ m , 18
                           ¯jm , 37
                           N
σm
 j , 32
                           ¯ m , 42
                           X
τC, 5                        j


θi (x), 17                 pˆm , 21
˜ m , 63
Λ                          Pm , 34
  j

γ˜ (A |x, t), 61           Qm , 34
                           ¯ jm , 35
                           Y


                         116
                               117


Sjm , 20
T1 , 63
T2 , 63
X , 34
Y, 34
Z, 34
p˜ (x, ·), 63
c (x, t), 63
g = (gnum , gloc ), 36
gjm = (gnum,j
        m        m
              , gloc,j ), 33
h, 33
m, 19
pm , 18
q m (C1 × C2 × B), 44
q m (dn, dσ |t), 43
rm (A1 × A2 × B), 44
rm (dx, dy |t), 43