Analysis of an Interacting Particle Method for Rare Event Estimation by Yi Cai M.S., Brown University, Providence, RI, 2007 B.S., Peking University, Beijing, China 2006 B.A., Peking University, Beijing, China 2006 A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Division of Applied Mathematics at Brown University Providence, Rhode Island May 2012 Abstract of “Analysis of An Interacting Particle Method for Rare Event Estimation” by Yi Cai, Ph.D., Brown University, May 2012 This thesis is a large deviations study for the performance of an interacting particle method for rare event estimation. The analysis is restricted to a one-dimensional setting, though even in this restricted setting a number of new techniques must be developed. In contrast to the large deviations analysis of related algorithms, for interacting particle schemes it is an occupation measure analysis that is relevant, and within this framework many standard assumptions (stationarity, Feller property) can no longer be assumed. The methods developed are not limited to the question of performance analysis, and in fact give the full large deviations principle for such systems. c Copyright 2012 by Yi Cai This dissertation by Yi Cai is accepted in its present form by the Division of Applied Mathematics as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date Paul Dupuis, Ph.D., Advisor Recommended to the Graduate Council Date Hongjie Dong, Ph.D., Reader Date Kavita Ramanan, Ph.D., Reader Approved by the Graduate Council Date Peter Weber, Dean of the Graduate School iii To My Parents iv The Vita of Yi Cai Yi Cai was born in Nanchang, Jiangxi province, China on January 8, 1985. In 2002, she started her undergraduate study at Peking University, and received both a Bachelor of Science degree in Theoretical Mechanics and a Bachelor of Arts degree in Economics four years later. She came to the United States in 2006 and has been attending the Ph.D. program in the Division of Applied Mathematics at Brown University. During the Ph.D. program, she received a Master of Science degree in Applied Mathematics in 2007. This dissertation was defended in April 27, 2012. v Acknowledgments First and foremost, I would like to express my deepest gratitude to my advisor Prof. Paul Dupuis. This dissertation would have not been possible without his support, understanding and encouragement. Coming from a non-mathematical background, I started out with not much knowledge in large deviations and few experience in scientific research. Prof Dupuis, with unsurpassed patience, helped me overcome those challenges and pointed me to the right direction. His rigorous attitude towards scientific research have been a tremendous influence on me and shaped me profoundly. I am very fortunate to have him as a mentor. I would also like to thank my other committee members Prof. Hongjie Dong and Prof. Kavita Ramanan for their helpful suggestion and comments during the completion of this dissertation. I am grateful to the Division of Mathematics and graduate school of Brown University for generous financial support in the past six years. I also want to thank all of my friends in Brown for helping me in every aspect. My special appreciation goes to Jingjing Wang for her friendship and support, which helped me overcome many crisis situations. Last but not least, I would like to thank my parents for their unconditional love. They have always been supportive of everything of me even though they are at the other side of the world. vi Contents Acknowledgments vi 0 Introduction and Preliminaries 1 0.1 Monte Carlo Simulation and Asymptotic Optimality . . . . . . . . . . 1 0.2 Beyond Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 4 0.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 I Problem Formulation and Main Results 16 1 Notation and Problem Formulation 17 1.1 Process Model and the Event of Interest . . . . . . . . . . . . . . . . 17 1.2 Interacting Particle System (IPS) Estimator . . . . . . . . . . . . . . 19 1.3 Reformulation as an Occupation Measure Problem . . . . . . . . . . . 22 1.4 First Entrance Distributions and the Feller Property . . . . . . . . . 23 2 Statement of the Main Results 28 II Large Deviation Analysis 30 3 Relative Entropy Representation 31 3.1 The Representation Formula . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 The Limiting Behavior of the Underlying Transportation Kernel . . . 37 vii 4 The Large Deviation Upper Bound 42 4.1 Notation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Tightness and Characterization of Limits . . . . . . . . . . . . . . . . 46 4.3 Proof of the Lower Bound (2.2) . . . . . . . . . . . . . . . . . . . . . 57 5 The Large Deviation Lower Bound 60 5.1 Some Important Assumptions and Ergodicity of the Limit Underlying Transition Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 The Construction of Ergodic IPS Transition Kernels . . . . . . . . . . 63 5.3 The Construction of Ergodic Controls . . . . . . . . . . . . . . . . . . 70 5.4 Proof of the Upper Bound (2.3) . . . . . . . . . . . . . . . . . . . . . 77 5.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6 Asymptotics of the Decay Rate as N → ∞ 95 6.1 Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 III Numerical Results 101 7 Computational Results 102 Bibliography 113 Index 116 viii List of Figures 7.1 large deviation rate of the second moment of the interacting particle system estimator for different m and N . . . . . . . . . . . . . . . . . 104 7.2 trade-off between N and K with fixed effort, N × K = 1, 000, 000, m = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.3 trade-off between N and K with fixed effort, N × K = 1, 000, 000, m = 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4 trade-off between N and K with fixed effort, N × K = 1, 000, 000, m = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.5 sample error, m = 10, fixed K . . . . . . . . . . . . . . . . . . . . . . 109 7.6 sample error, m = 50, fixed K . . . . . . . . . . . . . . . . . . . . . . 109 7.7 standard error, m = 10, fixed K . . . . . . . . . . . . . . . . . . . . . 110 7.8 standard error, m = 50, fixed K . . . . . . . . . . . . . . . . . . . . . 110 7.9 standard error, m = 10, fixed N . . . . . . . . . . . . . . . . . . . . . 111 7.10 decay rate in N, m = 10, fixed K . . . . . . . . . . . . . . . . . . . . 111 7.11 decay rate in N, m = 50, fixed K . . . . . . . . . . . . . . . . . . . . 112 ix Chapter 0 Introduction and Preliminaries 0.1 Monte Carlo Simulation and Asymptotic Optimality A basic and common technique for the estimation of probabilities and functionals of probability measures is the standard Monte Carlo simulation. Let X be a random variable taking values in R distributed according to µ ∈ P(R). To estimate the expectation of X, denoted by EX, one can generate K number of independently and identically distributed (iid) samples from µ, say X0 , X1 , . . . , XK−1 , and the estimate for EX is just the sample mean: . 1 X K−1 QK = Xi . K i=0 It is easy to check that this estimator is unbiased. Furthermore, Law of Large Num- bers (LLN) ensures that this estimate converges to the correct value as the number of samples K goes to ∞. The sample size K can be chosen by some predetermined precision requirement. However, when the events of interest are of small probabilities (termed rare 1 2 events), the standard Monte Carlo simulation will require an enormous number of samples for the variance of the resulting estimate to be comparable to the unknown probability. To be precise, we introduce the large deviation methodology. We begin by giving the definition of rate function and large deviation principle. Let X be a complete separable metric space, i.e. a Polish space. Definition 0.1 (Rate Function). A function I on X is called a rate function on X , or simply a rate function, if I maps X into [0, ∞] and if for each M < ∞ the level set {x ∈ X : I(x) ≤ M } is a compact subset of X . Let {X m , m ∈ N} be a sequence of random variables defined on a probability space (Ω, F, P) and taking values in a polish space X . Definition 0.2 (Large deviation Principle). Let I be a rate function on X . The sequence {X m } is said to satisfy the large deviation principle on X with rate function I if the following two conditions hold. (a) Large deviation upper bound. For each closed subset F of X 1 lim sup log P {X m ∈ F } ≤ − inf I (x) . m→∞ m x∈F (b) Large deviation lower bound. For each open subset G of X 1 lim inf log P {X m ∈ G} ≥ − inf I (x) . m→∞ m x∈G With the definition of large deviation principle in hand, we now give a concrete example of the ill-performance of standard Monte Carlo for rare events simulation, but also keep in mind that the issues hold in general. Let Y 0 , Y 1 , . . . be a sequence of iid Rd -valued random variables with distribution R µ, and assume the moment generating function Rd exphα, yiµ(dy) is finite for all α ∈ Rd . Let S m = Y 0 + Y 1 + · · · + Y m−1 . For Borel sets A ⊂ Rd , Cram´er’s 3 Theorem [6, Theorem 2.2.3] guarantees a large deviation principle for S m /m. Let . pm = P(S m /m ∈ A). We also define the indicator function by   1 x∈A 1A (x) = .  0 otherwise Now let’s consider the Monte Carlo approximation for pm for some fixed value m. In terms of the notation introduced previously, we have X = 1A (S m /m), the straight- forward Monte Carlo would require the generation of many iid copies of X (i.e., of S m ). The variance of QK is given by "  m # 1 X K−1 Si pm (1 − pm ) var [QK ] = var 1A = . K i=0 m K If pm → 0 as m → ∞, this quantity approaches zero. However, the relative error is p . standard deviation of QK pm (1 − pm ) relative error = = √ . mean of QK Kpm p Since pm (1 − pm )/pm → ∞, a large sample size (i.e., K) is required for the estima- tor to achieve a reasonable relative error bound (at least when m is large). In fact, since pm scales according to a large deviation principle, K must grow exponentially in m if a bounded relative error is to be maintained. As a result, standard Monte Carlo simulation is rarely used to estimate rare event probabilities and more efficient alternatives are explored. Before introducing other more “efficient” schemes, a natural question to ask is: What is a mathematical criteria for “efficient”? The above discussion suggests a solution: for fixed m, the faster the variance decays, the smaller sample size K is required. We define those schemes that achieve the “fastest” rate of convergence as asymptotically optimal. The formal mathematical definition is formulated as following. 4 Asymptotic Optimality Assume random variables {X m , m = 0, 1, . . .} satisfies a large deviation limit: 1 lim log EX m = −ρ, m→∞ m ¯m for some ρ ≥ 0. In order to estimate EX m , an alternative random variable X ¯ m . Similarly to standard Monte Carlo, one is constructed such that EX m = EX generates iid replications (X ¯ m ) of X ¯ 0m , . . . , X ¯ m and obtain an unbiased estimator K−1 by averaging: X K−1 ¯ m . 1 ¯ im . QK = X K i=0 The rate of convergence associated with this estimator is determined by the variance of the summands. Since unbiasedness implies that the first moment is fixed, it suffices ¯ 0m )2 . However, it follows from Jensen’s inequality to consider the second moment E(X that 1   lim sup − ¯ m 2 ≤ lim − 1 log EX log E X ¯ m 2 = 2ρ. (0.1) 0 0 m→∞ m m→∞ m ¯ m is said to be asymptotically optimal if this upper bound is achieved, The estimator Q K i.e. if 1  lim − ¯ m 2 = 2ρ. log E X 0 m→∞ m This thesis studies the performance of one of those more efficient schemes for rare event simulation, namely, interacting particle scheme. The above concept will be the focus of the analysis. We will come back to the definition of asymptotic optimality in Chapter 1. 0.2 Beyond Monte Carlo Simulation The two most widely considered alternatives to the standard Monte Carlo simula- tion are those based on change-of-measure techniques and those based on branching 5 processes. The schemes that this dissertation is interested in belongs to the second category. Importance Sampling Schemes The first alternative is usually referred as importance sampling. Suppose that the probability of interest takes the form p = P(X ∈ A) = µ(A), where A is a subset of some reasonably regular space and µ a probability measure. Contrast to ordinary Monte Carlo which samples directly from µ, importance sampling uses an alternative ¯ i from ν, and then estimates via the sampling distribution ν, generates iid samples X ¯ i )1{X¯ ∈A} . The Radon-Nikodym derivative [dµ/dν](X sample mean of [dµ/dν](X ¯i) i guarantees that the estimate is unbiased. The goal is to choose ν so that the in- ¯ i )1{X¯ ∈A} cluster tightly around p, thereby reducing the dividual samples [dµ/dν](X i variance. The characterization of optimal achievable performance of importance sampling schemes in the large deviation is discussed in [10, 11] in terms of a deter- ministic differential game. And [12] suggests a way to construct importance sampling schemes based on subsolutions of the Isaacs equation. Branching Type Schemes For branching type schemes, the most discussed application is the first entrance probabilities, and to continue the discussion we specialized to that case. Consider a Markov process X = {Xi , i = 0, 1 . . .}, and the probability of interest is p = P {τ A ≤ τ A∪B |X(0) = x} for sets A and B, where τ C = min{i ≥ 0 : Xi ∈ C}. We consider just the discrete time case, but note that the continuous time case is also of interest. Suppose that B is a “typical set” and that A is rare, in that with high probability (perhaps very high probability) B is entered prior to A. 6 When using branching processes, the idea is to branch the process in such a way that the a priori rare outcome becomes more likely for at least some of the descendants (or particles), and then normalize to produce a single unbiased estimate of p. As with ordinary Monte Carlo one can then generate a number of independent copies of the branched process, and use the sample average to estimate p. Since the copies are independent the performance of the scheme is determined by the variance of a single copy. Key questions in the construction of the branching process are then when to branch (the triggers) and how much to branch (the rates). A natural and standard approach to defining triggers is the use of thresholds, i.e., nested sets {Cj , j = 0, 1, . . . , m} with C0 ⊃ C1 ⊃ · · · ⊃ Cm ⊃ A, and a rule for branching when these sets are entered. For example, branching of a particular particle could occur the first time either it or perhaps some selected collection of particles containing it first reaches a threshold set. Within a framework where branching is triggered by thresholds there are many possible variants. In what appears to be the oldest class of such methods, particles do not interact, and branching of any particular particle occurs the first time it reaches a threshold that has not been reached before. The number of descendants could be random and it could depend on other factors such as the position of the particle when entering the threshold, but it is independent of the state of all other particles at the time of the branching. (If more than one new threshold is entered at the same time then the distribution of the number of descendants is an appropriate convolution.) The mean number of descendants is often a constant (an assumption which for purposes of discussion we will assume true), and hence these types of schemes are sometimes called Fixed Rate Branching schemes. Besides the basic fixed branching rate scheme [16, 17, 19], more efficient variants which kill off particles that are of little use while remaining unbiased have been developed [3, 18, 22, 23, 24]. 7 There are many challenges in the design of such schemes. For a large scale problem care must be taken to ensure that the total number of particles, whose distribution obviously depends on the precise placement of the thresholds, is not too large. At the same time, significant reduction of the variance of the estimator requires that enough particles make it all the way to the rare set. The challenge of the design problem is to find the proper balance between these opposing requirements. In a second class of methods, a particle can be branched only when all particles that started from the previous threshold (say Cj−1 ) have either reached Cj or B. The branching here will not be conditionally independent of the state of all other particles, and in fact the branching is done so as to maintain a fixed and determin- istic number (say N ) of particles at each threshold. To be precise, suppose that {Xj−1,n , n = 1, . . . , N } denote the locations of N particles in threshold Cj−1 . Start- ing with these locations, particles are independently evolved under the dynamics of X until either Cj or B is reached. Suppose that Nj ≤ N particles reach Cj before B, and let Sj denote the collection of corresponding locations. In order to maintain N particles, we let Xj,n for n = 1, . . . , Nj be chosen to coincide with the Nj positions in Sj (the order is unimportant), and then construct Xj,n for n = Nj + 1, . . . , N by in- dependently sampling from Sj according to the uniform distribution. [If Nj = 0 the simulation terminates and zero is returned as the estimate.] Repeating this process over say m thresholds which separate the initial point from A, we get an unbiased estimate Y m Nj pˆ = (0.2) j=1 N of P {τ A ≤ τ A∪B }. Methods of this sort are studied in [4, 5, 14, 15], and since the total computational effort (or at least the total number of particles) is a priori known, they are referred to as Fixed Effort Branching methods. Other names used for these sorts of schemes are interacting particle systems (IPS) and genealogical particle methods. Large deviation analysis of both standard splitting and the variation known as 8 RESTART have been carried out in [2] and [3]. Under the large deviation scaling (and for light-tailed processes), it is natural to let the thresholds be defined as level sets of a function V that is independent of m, and then let the “gaps” between levels scale proportionally to 1/m. Associated to the estimation problem is a partial differential equation [PDE] (specifically, a first order Hamilton-Jacobi-Bellman equation) defined in terms of the sets A and B and the large deviation rate function of X m . The first conclusion of [2] and [3] is that the total computational effort associated with generating a single sample scales subexponentially with m if and only if V / log R is a subsolution to this PDE, where R is the mean number of particles produced at each branching. It is well known that pm decays like e−mW (x) , where W is a weak-sense solution to the same PDE. The second conclusion is that the second moment of the estimator (defined via the thresholds of V and with mean branching rate R) decays like e−m[W (x)+V (x)/ log R] , where x is the starting point X m (0) = x. Hence, in this rough exponential sense, the standard deviation of the estimator is of the same size as the probability of interest if and only if V (x)/ log R achieves the maximum possible value at x [i.e., W (x)]. When this is not the case the ratio of the standard deviation to the probability grows exponentially as em[W (x)−V (x)/ log R]/2 . Hence significant improvement over ordinary Monte Carlo [i.e., V (x) = 0] is still possible, and the improvement is quantified by V (x). In contrast, very little is known regarding the performance of fixed effort schemes in the small probability limit. Analysis to date have studied the variance when the number of particles is first sent to ∞ (so that exchangeability and asymptotically negligible interaction between particles can be exploited), and then the probability is made small. Since fixed effort schemes are at least in part motivated by a desire to control computational effort in a predictable way, this is not an ideal measure of performance. One would prefer to keep the number of particles fixed while letting the probability tend to zero and, as with the fixed rate schemes, compare the standard deviation of the estimator with the probability. It is not surprising that such an 9 analysis is difficult. With standard splitting, any two particles that reach A have a fairly simple dependence. At some time in the past these particles have a last common ancestor. Prior to that the particles have the same history and after that the evolutions of the two are independent. One can partition particles according to the time and location of the last ancestor, and in this way exploit this relatively simple dependence. More complex implementations are more difficult to analyze, but the dependence between particles is still limited. The situation with fixed effort schemes is very different. Here dependence be- tween particles is reintroduced with each resampling, which makes the analysis much more difficult. In the non-interacting case, one can analyze the variance of the es- timator using large deviation estimates and techniques related to Freidlin-Wentsell type “small noise” asymptotics. In the interacting case the relevant techniques are related to those used to analyze the empirical or occupation measure of a Markov process, though there are numerous difficulties that make the analysis significantly more technical and complex. Among the difficult features besides the recoupling mentioned previously are: (i) the natural time scale for the process is not the natural time scale for the algorithm, which is in fact indexed by thresholds, (ii) the tran- sition kernels associated with threshold crossing typically do not satisfy the Feller property, (iii) the transition kernels also depend on the large deviation parameter m, and (iv) the derived estimates should be valid all the way from a small number of particles (where large deviation effects dominate) to many particles (where law of large numbers effects dominate). In this thesis we consider only the one dimensional problem, which is already somewhat demanding, and leave the multidimensional problem for further study. An outline of the thesis is as follows. In the next chapter we formulate the problem and the algorithm in precise terms, and state most of the assumptions that will be used. We also discuss the failure of the Feller property mentioned previously, and indicate how this issue will be addressed in the analysis. Chapter 2 gives the statement of 10 the main result, which is the large deviation decay rate for the second moment of the IPS estimator for an arbitrary number of particles. Chapter 3 develops a convenient stochastic control representation for this second moment. Chapter 4 proves tightness of measures and processes appearing in the representation, characterizes weak limits, and proves the large deviation upper bound. Ergodicity properties for controlled processes are discussed in Chapter 5, after which the large deviation lower bound is proved. Chapter 6 studies the decay rate as a function of the number of particles. Among the conclusions of this section are that for any fixed N the IPS method has a suboptimal rate of convergence of the second moment to zero, and with a well defined “gap.” However, another conclusion is that the IPS method is also less sensitive to design issues than competing schemes, at least in the one dimensional setting. The issue is less clear cut for multidimensional problems. 0.3 Preliminaries In this section, we states some useful concepts and theorems which will be needed throughout this thesis. We begin by giving the definition of Laplace principle which can be seen as an alternative definition for large deviation principle (Definition 0.2). See [9, Section 1.2] for detail discussions. In this dissertation, we will use this definition to prove the large deviation property of the second moment of IPS estimator. Definition 0.3 (Laplace Principle). Let I be a rate function on X . The sequence {X m } is said to satisfy the Laplace principle on X with rate function I if for all bounded continuous functions h 1 lim log E {exp [−mh (X m )]} = − inf {h (x) + I (x)} . m→∞ m x∈X 11 The term Laplace principle upper bound refers to the validity of 1 lim sup log E {exp [−mh (X m )]} ≤ − inf {h (x) + I (x)} m→∞ m x∈X for all bounded continuous function h while the term Laplace principle lower bound refers to the validity of 1 lim inf log E {exp [−mh (X m )]} ≥ − inf {h (x) + I (x)} . m→∞ m x∈X for all bounded continuous functions h. Another important concept is relative entropy, which will play a key role in the weak convergence approach to large deviations via the variational formula given in part (a) of Proposition 0.5. The derivation of this formula requires a measurable space (S, A). We denote by P(S) the set of probability measures on (S, A). Definition 0.4 (Relative Entropy). Given γ, θ ∈ P(S), the relative entropy of γ with respect to θ is defined by Z   dγ R(γ kθ ) = log (x) γ(dx) S dθ if γ is absolutely continuous with respect to θ, i.e. γ  θ, and log([dγ/dθ](x)) is integrable with respect to γ, and by ∞ otherwise. The next proposition [9, Proposition 1.4.2], which states the variational formula, is a cornerstone for the weak convergence approach. Proposition 0.5. Let (S, A) be a measurable space, k a bounded measurable function mapping S into R, and θ a probability measure on S. The following conclusions hold. (a) We have the variational formula Z  Z  −k − log e dθ = inf R (γ kθ ) + kdγ . (0.3) S γ∈P(S) S 12 (b) Let γ 0 denote the probability measure on S which is absolutely continuous with respect to θ and satisfies dγ 0 . 1 (x) = e−k(x) · R −k . dθ S e dθ Then the infimum in the variational formula (0.3) is attained uniquely at γ 0 . An extended version of Proposition 0.5 will also be needed, given in [9, Proposi- tion 4.5.1]. Proposition 0.6. Let S be a Polish space, k a measurable function mapping S into R which is either bounded from below or bounded from above, and θ a probability measure on S. Then we have the variational formula Z  Z  −k − log e dθ = inf R (γ kθ ) + kdγ , S γ∈∆(S) S . where ∆(S) = {γ ∈ P(S) : R(γkθ) < ∞}. The next lemma and theorem [9, Lemma 1.4.3, Theorem B.2.1] investigate some properties of relative entropy: convexity, lower semicontinuity, compactness of level sets, uniform integrability of sequences of measures satisfying a suitable relative entropy bound, factorization, approximation by sums and chain rule, etc. Let X be a Polish space. We denote by Cb (X ) the space of bounded continu- ous functions mapping X into R and by Ψb (X ) the space of bounded measurable functions mapping X into R. Lemma 0.7. Let X and Y be Polish spaces. The relative entropy R(·k·) has the following properties. 13 (a) (Donsker-Varadhan variational formula.) For each γ and θ in P(X ) Z Z  g R (γ kθ ) = sup gdγ − log e dθ g∈Cb (X ) X X Z Z  ψ = sup ψdγ − log e dθ . (0.4) ψ∈Ψb (X ) X X (b) R(γkθ) is a convex, lower semicontinuous function of (γ, θ) ∈ P(X )×P(X ). In particular, R(γkθ) is a convex, lower semicontinuous function of each variable γ or θ separately. In addition, for fixed θ ∈ P(X ) R(·kθ) is strictly convex on the set {γ ∈ P(X ) : R(γkθ) < ∞}. (c) For each θ ∈ P(X ) R(·kθ) has compact level sets. That is, for each M < ∞ the set {γ ∈ P(X ) : R(γkθ) ≤ M } is a compact subset of P(X ). (d) Let X = Rd and let {γ n , n ∈ N} and {θn , n ∈ N} be sequences in P(Rd ). Assume that for each α ∈ Rd Z sup exp hα, yi θn (dy) < ∞ and sup R (γ n kθn ) < ∞. n∈N Rd n∈N Then {γ n } is both tight and uniformly integrable in the sense that Z lim sup kyk γ n (dy) = 0. C→∞ n∈N {y∈Rd :kyk>C} (e) Let X = Rd and let γ and θ be probability measures on Rd . Assume that for each α ∈ Rd Z exp hα, yi θ (dy) < ∞ Rd R and that R(γkθ) < ∞. Then Rd kykγ(dy) < ∞. (f ) Let σ(dy|x) and τ (dy|x) be stochastic kernels on Y given X and θ a probability measure on X . Then the function mapping x ∈ X 7→ R(σ(·kx)kτ (·|x)) is 14 measurable and Z R (σ (· kx) kτ (· kx)) θ (dx) = R (θ ⊗ σ kθ ⊗ τ ) . X (g) We denote by Π the class of all finite measurable partitions of X . Then for each γ and θ in P(X ) X γ (A) R (γ kθ ) = sup γ (A) log , π∈Π A∈π θ (A) where the summand equals 0 if γ(A) = 0 and equals ∞ if γ(A) > 0 and θ(A) = 0. In addition, if A is any Borel subset of X , then γ (A) R (γ kθ ) ≥ γ (A) log − 1. θ (A) Let X and Y be Polish spaces and let τ (dy|x) be a family of probability measures on Y parameterized by x ∈ X . We call τ (dy|x) a stochastic kernel on Y given X if for every Borel subset E of Y the function mapping x ∈ X 7→ τ (E|x) ∈ [0, 1] is measurable. Theorem 0.8 (Chain Rule). Let X and Y be Polish spaces and α and β probability measures on X × Y. We denote by α1 and β 1 the first marginals of α and β and by α(dy|x) and β(dy|x) the stochastic kernels on Y given X for which we have the decompositions α (dx × dy) = α1 (dx) ⊗ α (dy |x) and β (dx × dy) = β 1 (dx) ⊗ β (dy |x ) . Then the function mapping x ∈ X 7→ R(α(·|x)kβ(·|x)) is measurable and Z R (αk β) = R (α1 k β 1 ) + R (α (· |x)k β (· |x )) α1 (dx) . X 15 The next theorem [9, Theorem A.5.6] states the decomposition of stochastic ker- nels. Theorem 0.9 (Decomposition Theorem). Let σ(dx × dy|z) be a stochastic kernel on X × Y given S and θ a probability measure on (S, A). We denote by σ 1 (dx|z) the first marginal of σ(dx × dy|z), which is the stochastic kernel on X given S defined . by σ 1 (A|z) = σ(A × Y|z) for Borel sets A and z ∈ Y. Then there exists a stochastic kernel σ(dy|x, z) on Y given X × S such that θ-a.s. for z ∈ S Z Z σ (A × B |z ) = σ 1 (dx |z ) σ (dy |x, z ) = σ (B |x, z ) σ 1 (dx |z ) A×B A for all Borel subesets A of X and B of Y. This decomposition is summarized as σ(dx × dy|z) = σ 1 (dx|z) ⊗ σ(dy|x, z). Assume p(x, dy) a transition probability function defined on polish space S. The Feller property ([9, Condition 8.3.1]) is defined as following: Definition 0.10 (Feller Property). The function mapping x ∈ S 7→ p(x, ·) ∈ P(S) is continuous in the topology of weak convergence on P(S); i.e., if {xn , n ∈ N} is any sequence in S such that xn → x ∈ S, then p(xn , ·) ⇒ p(x, ·). Part I Problem Formulation and Main Results 16 Chapter 1 Notation and Problem Formulation In this chapter we formulate the problem of interest, which is the analysis via large deviations of the performance of an interacting particle system (IPS) for rare event estimation. We begin by identifying the event of interest and the related large deviation asymptotics in the next section. In the following section the estimator is constructed and standard performance criteria for rare event estimation are reviewed. In the last section we show how these criteria are reformulated in terms of the occupation measure of a related stochastic process. 1.1 Process Model and the Event of Interest In this thesis, we restrict discussion to a one dimensional process, but within that framework allow a state dependent discrete time Markov model. The process model is constructed as follows. Let {θi (x) , i ∈ N0 , x ∈ R} be a sequence of independent and identically distributed (iid) random fields with the common distribution P {θi (x) ∈ dy} = µ (dy |x ) . (1.1) 17 18 For m ∈ N, define a Markov chain by setting Z0m = 0 and recursively defining m 1 Zi+1 = Zim + θi (Zim ) . (1.2) m Here m is the large deviation parameter, and we are interested in events that become rare as m → ∞. We also consider a continuous time interpolant, which is defined by Z m (t) = Zim if t ∈ [i/m, i/m + 1/m). We consider Z m (·) to take values in the standard Skorokhod space. Conditions 1.1 and 1.3 introduced below will be assumed throughout the paper. Condition 1.2 will be replaced by the strictly stronger Condition 1.4 when discussing the performance of the IPS scheme. Condition 1.1 (Stability). supx∈R Eθi (x) < 0. The event of interest is then . pm = P {Zσmm ≥ 1 |Z0m = 0} , (1.3) where σ m = min {i ≥ 0 : Zim ∈ [1, ∞) ∪ [−∞, −1/m)}. Although the methods we develop below can be used to analyze a variety of problem formulations, the event studied here is common in the rare event literature for the analysis of algorithms. To identify the rate of decay via large deviations, we need some standard notation and conditions. For the event of interest it is only the definition of µ (dy |x) on [−1/m, 1] that matters, and so we pose conditions henceforth only on x ∈ [−1, 1]. For each such x and α ∈ R, let Z H (x, α) = log eαy µ (dy |x ) . (1.4) R For a Polish space S, let P (S) denote the set of probability measures on S, equipped with the topology of weak convergence. Condition 1.2. For each α ∈ R, supx∈[−1,1] H (x, α) < ∞, and the mapping x ∈ R 7−→ µ (· |x ) ∈ P (R) is continuous in the topology of weak convergence on P (R). 19 The following properties are well known under the last condition [9]. Let L(x, ·) be the Legendre transform of H(x, ·), i.e., for β ∈ R L(x, β) = sup [αβ − H (x, α)] . α∈R Then L(x, β) ≥ 0 and L(x, β) = 0 if and only if β = Eθi (x). Condition 1.2 is sufficient for Z m to satisfy a large deviation upper bound on path space. Various sets of conditions suffice for the corresponding lower bound, and the particular condition that is assumed below is sufficient for the lower bound and the existence of a large deviation limit for pm . See [9] for alternatives. Condition 1.3. The interior of the convex hull of the support of µ (· |x) is indepen- dent of x and contains 0. Under all three conditions we can assert that 1 1 lim log pm = lim log P {Zσmm ≥ 1 |Z0m = 0} = −ρ, (1.5) m→∞ m m→∞ m where Z τ  ρ = inf ˙ L(φ(s), φ(s))ds : φ(0) = 0, φ(τ ) = 1, φ(s) ∈ (0, 1) for s ∈ (0, τ ) 0 ∈ (0, ∞), and where the integral is taken to be ∞ if φ is not absolutely continuous. 1.2 Interacting Particle System (IPS) Estimator While the large deviation asymptotics described in the last section give an estimate of the probability for any m, this estimate can be off significantly for values of m that are of interest. For this reason there is much interest in improved approximations, and in particular in higher accuracy Monte Carlo approximations. 20 The construction of an IPS estimator for pm can be described as follows. We introduce a sequence of sets (often defined in higher dimensions via level sets) of the form m m Cm ⊂ Cm−1 ⊂ · · · ⊂ C0m . In the present one dimensional setting we will use Cjm = {x : x ≥ j/m}. (An inter- esting feature of the IPS scheme that will be proved is that, in the one dimensional setting, it is relatively insensitive to this choice). The construction of the estimate is recursive and indexed by stages, which are identified with the entrance into a corresponding set (we will use the terms stage and threshold interchangeably). The construction at each stage is partitioned into transportation and resampling aspects. For particles n = 1, . . . , N and thresholds j = 0, . . . , m − 1, define stopping times σm j,n and stopped processes recursively as follows. Let {θ j,n,i (x) , x ∈ [−1, 1]} be ran- dom fields, which are independent and identically distributed for distinct values of j ∈ {1, . . . , m − 1}, n ∈ {1, . . . N } and i ∈ N0 . Stage j = 0. Simulate N copies of (1.2), i.e., define m m 1 m  m Z0,n,i+1 = Z0,n,i + θ0,n,i Z0,n,i , Z0,n,0 = 0, m for n = 1, . . . , N and let  σm m m 0,n = min i ≥ 0 : Z0,n,i ∈ C1 ∪ (−∞, −1/m] , be the stopping time of the nth particle hitting the next stage C1m , or being absorbed by (∞, −1/m]. m m Stages j = 1, . . . , m − 1. Let Yj,n = Zj−1,n,σ m j−1,n , i.e., the location of the nth particle when it first enters Cjm or (−∞, −1/m]. Let Sjm = {Yj,n m m : Yj,n ∈ Cjm } and let Njm ≤ N be the number of elements of Sjm . Set Xj,n m m = Yj,n m if Yj,n ∈ Cjm , and m choose Xj,n for indices n not corresponding to an element of Sjm uniformly from Sjm . 21 m m m Define processes Zj,n,i via Zj,n,0 = Xj,n and m m 1 m  Zj,n,i+1 = Zj,n,i + θj,n,i Zj,n,i , m for n = 1, . . . , N , and again, let  σm j,n = min i ≥ 0 : Z m j,n,i ∈ C m j+1 ∪ (−∞, −1/m] . In general, the particles are thought of as evolving according to a two-step, genetic type mechanism, though we eschew the associated terminology here: m m  transportation m m  resampling m m  Xj,1 , . . . , Xj,N −→ Yj+1,1 , . . . , Yj+1,N −→ Xj+1,1 , . . . , Xj+1,N . Since each particle is distributed according to the first entrance distribution into threshold j, Njm /N is an unbiased estimate of P {Z m reaches threshold j |Z m reached threshold j − 1} . Thus, Y m Njm m pˆ = (1.6) j=1 N is in fact an unbiased estimate of Y m m p = P {Z m reaches threshold j |Z m reached threshold j − 1 } . j=1 The standard measure of performance in these circumstances is the variance of pˆm , which is to be minimized. Since Eˆ pm = pm , minimization of the variance is pm )2 . Recall the discussion in equivalent to minimization of the second moment E(ˆ Chapter 0, we say the estimator pˆm is asymptotically optimal if the upper bound is 22 achieved, i.e., if 1 lim − pm )2 = 2ρ. log E (ˆ m→∞ m Of course, besides performance one must consider the computational cost to produce each sample pˆm . However, (0.1) provides a bound on the best possible performance. 1.3 Reformulation as an Occupation Measure Problem The IPS approach to rare event estimation requires a very different set of tech- niques for its analysis than approaches based on branching processes that use a pre-determined branching rate. In particular, the statistical coupling of particles that occurs at each resampling stage prevents direct use of the “small noise” large deviation properties of the underlying distribution of Z m . Instead, large deviation asymptotics for the occupation measures associated with the particles that make it to the next threshold are relevant. Indeed, if 1 X m m L = δN m , (1.7) m j=1 j then the performance measure Ym  m 2 1 m 2 1 Nj p ) = − log E − log E(ˆ m m j=1 N can be rewritten in the form " P  m # 1 2 m Nj 1 h i j=1 log n = − log E e2m log( N )L (dn) . m R ˆm W = − log E e N (1.8) N m m This exhibits the performance measure as a Laplace functional of the occupation measure Lm . However, for many reasons the analysis of Lm is nonstandard, including 23 the following. 1. The “transition kernel” for Njm is only implicitly defined in terms of the basic problem data [i.e., the process dynamics µ (dy |x )], and involves a two-stage (transportation and resampling) construction. 2. The transition kernel will depend on x through µ (dy |x ). However, x here is identified with the stage of the construction, and so plays the role of a “time” variable and not a state variable. ˆ m in In the following chapters we carry out a full large deviations analysis of W the limit m → ∞. However, the resulting rate will depend in a nontrivial way on the fixed number of particles N . As a consequence, the dependence of the rate constant on the number of particles should also be analyzed. 1.4 First Entrance Distributions and the Feller Property As discussed in the last section, the large deviation properties of the IPS estimator are intimately connected with the large deviation properties of a related empirical measure on the number of particles making it to the next threshold (success parti- cles). The number of success particles is not by itself Markovian, and a transition kernel on first entrance distributions plays an essential role in its analysis. In this section the transition kernel that arises in the limit will be defined, and properties that are needed for a large deviation analysis will be discussed. Anticipating a change in notation later on, we replace x by t, and recall that for t ∈ [0, 1], θi (t) denotes a sequence of iid random variables with distribution µ(dy|t). Define Zi (t) by Zi+1 (t) = Zi (t) + θi (t) and set σ(t) = min {i ≥ 0 : Zi (t) ≥ 1}. Note that t here plays the role of a parameter. We next define a family of stochastic kernels indexed by t. The kernels will be measures on [0, ∞) ∪ {∆}, where ∆ corresponds to 24 the event that Zi (t) reaches −∞ before exceeding 1, and thereby enters a “cemetery” state. For v ∈ R and A ⊂ [0, ∞) ∪ {∆}, set     Z (t) − 1 ∈ A, σ(t) < ∞ Z0 (t) = v v ∈ (−∞, 1), A ⊂ [0, ∞)   P σ(t) γ (A|v, t) = P {σ(t) = ∞| Z0 (t) = v} v ∈ (−∞, 1), A = {∆} ,     δ (A) v−1 v ∈ [1, ∞), A ⊂ [0, ∞) (1.9) where δ a denotes the Dirac function at a. For each fixed t, v → γ(·|v, t) is a transition kernel on [0, ∞) ∪ {∆} that reflects the distributional properties of the “overshoot” i.e., the location of the process when it first enters the next threshold given that the current overshoot is v 6= ∆, when the “local” characteristics of the increments in the underlying process are determined by µ(dy|t). It is the local transition kernel that will be relevant in the IPS scheme in the limit m → ∞, in a sense that will be made precise in Lemma 3.2. Before proceeding we review some issues concerning large deviations for the occu- pation measures of a Markov chain. The first papers on this topic were [7, 8], which were subsequently generalized in many directions. One of the key assumptions in [7] and most other work on the problem is that the transition kernel satisfy the Feller property as in Condition 0.10. In the present context this would be the requirement that for each t ∈ [0, 1], the map v → γ(·|v, t) is continuous in the topology of weak convergence for v ∈ [0, ∞) ∪ {∆}. While such an assumption is reasonable in many circumstances, it is restrictive in the current setting, since γ is defined by a first entrance distribution. The Feller property can be weakened, as in [9, Section 9.2], which requires that the values of v where it fails are in some sense “negligible” and thus one is still in an essentially Feller type framework. In fact, one obtains a rate function that is of exactly the same form as 25 in the Feller case, and we will follow a similar approach here. However, for other, less structured weakenings of the Feller property, the resulting processes can display unusual large deviation properties with very non-standard forms for the rate function (when such processes can be analyzed at all). Indeed, many strange behaviors can occur when the dynamics of the chain allow the process to spend significant time near the discontinuities [13]. A second property that will be needed to establish large deviation bounds is regularity of γ(·|v, t) in t. Using the Markov property and a standard conditioning argument, one can uniquely characterize γ(·|v, t) for v ∈ (−∞, 1) and Borel sets A ⊂ [0, ∞) via the equation [see the derivation of (3.11)] Z γ(A|v, t) = γ(A|v + y, t)µ(dy|t) + µ(A + 1 − v|t). (1.10) {y:v+y<1} It is apparent from this equation that the atoms of µ(·|t) play an essential role in determining whether a Feller property or a “near” Feller property holds, and that regularity in t will impose further conditions. It should also be noted that even if µ(·|t) has a density that is as smooth as desired with respect to Lebesgue measure, γ(·|v, t) will not satisfy the ordinary Feller property because of the discontinuity at v = 1. We will use the following assumption, which is general enough to cover many standard models and simple to verify. Note that this condition implies Condition 1.2. Condition 1.4. Consider reference probability measures ν d and ν c , each with a finite moment generating function, such that ν c is absolutely continuous with respect to Lebesgue measure and there is an integer q such that the support of ν d is contained in Z/q. We then assume µ(dx|t) = fd (x, t)ν d (dx) + fc (x, t)ν c (dx), where fd (x, t) and fc (x, t) are continuous and uniformly positive and bounded in x and t. Under this condition either the Feller property will hold, or a near Feller property will hold so that the large deviation properties are as if the Feller property were true. 26 The next lemma identifies the places where the Feller property fails, and also shows that continuity from the right always holds. Lemma 1.5. Assume Conditions 1.1 and 1.4. Then for each t ∈ [0, 1] {v ∈ [0, 1] : v → γ(·|v, t) is discontinuous} ⊂ (0, 1] ∩ [Z/q] , and γ(·|v, t) is continuous from the right for all v. Proof. Let t ∈ [0, 1] be given, and let θi = θi (t) be iid with distribution µ(·|t) (for simplicity we omit the t-dependence of θi ). For v ∈ [0, 1), define ∞ " # . X X i Rv = v+ θk − 1 1{v+P` θk <1 for `=1,...,i−1} 1{v+ ik=1 θk ≥1} . P k=1 i=1 k=1 P` Then with the understanding that v + k=1 θk < 1 for all ` < ∞ means Rv = ∆, the distribution of Rv is γ(·|v, t). Thus for a bounded and continuous function f : R → R Z f (x)γ(dx|v, t) R "∞ ! # X X i = E f v+ θk − 1 1{v+P` θk <1 for `=1,...,i−1} 1{v+ ik=1 θk ≥1} P . k=1 i=1 k=1 Suppose that v is not an element of (0, 1] ∩ [Z/q]. Then for each i the indicator functions in the last display are continuous at v w.p.1, and so by the Lebesgue Dominated Convergence Theorem, if vj → v as j → ∞ then Z Z f (x)γ(dx|vj , t) → f (x)γ(dx|v, t). R R The analogous argument applies when v ∈ (0, 1] ∩ [Z/q] if we restrict to vj ↓ v, which proves the right continuity. We will also need the following result, which indicates how the places where the 27 R Feller property fails can be avoided. Let µn ⊗ γ(A × B|t) = A γ(B|v, t)µn (dv), and similarly for µ ⊗ γ. Lemma 1.6. Fix t ∈ [0, 1], and assume that {v ∈ [0, 1] : v → γ(·|v, t) is discontinuous} ⊂ (0, 1] ∩ [Z/q] , and γ(·|v, t) is continuous from the right for all v. Let µn ⇒ µ, where µn ∈ P([0, 1]). Assume that for all ε > 0 there is δ > 0 such that µn (∪qk=1 ([k/q] − δ, [k/q])) ≤ ε. Then µn ⊗ γ(dv × dy|t) ⇒ µ ⊗ γ(dv × dy|t). Proof. To simplify notation we omit the t dependence. We need to show for bounded and continuous g and h that Z Z g(v)h(y)(µn ⊗ γ)(dv × dy) → g(v)h(y)(µ ⊗ γ)(dv × dy). R Now f (v) = h(y)γ(dy|v) is continuous from the right for all v and from the left save (possibly) those v of the form k/q. Let f δ (v) be continuous, have the same uniform bound as f , and equal to f outside ∪qk=1 ([k/q] − δ, [k/q]). Then Z Z Z g(v)µn (dv) h(y)γ(dy|v) − g(v)µn (dv)f δ (v) ≤ 2ε khk kgk . ∞ ∞ Since µn ⇒ µ implies µ(∪qk=1 ([k/q] − δ, [k/q])) ≤ ε, Z Z Z g(v)µ(dv) h(y)γ(dy|v) − g(v)µ(dv)f (v) ≤ 2ε khk kgk , δ ∞ ∞ and since f δ (x) is continuous, µn ⇒ µ implies Z Z lim sup g(v)h(y)µn (dv)γ(dy|v) − g(v)h(y)µ(dv)γ(dy|v) ≤ 4ε khk∞ kgk∞ . n→∞ The result now follows since ε > 0 is arbitrary. Chapter 2 Statement of the Main Results In this chapter we present the main theoretical result, which is the limit of the second moment for the IPS estimator as m → ∞. For n = 0, 1, . . . , N , let F N (n) = −2 log (n/N ) (with the understanding that F N (0) = ∞), and for ` ∈ P ({0, . . . , N }) let N X N F ,` = F N (n)`(n). (2.1) n=0 Recall that W N pm )2 . It will be evident from ˆ m is defined in (1.8) to be −(1/m) log E (ˆ the theorem statement that the decay rate for the second moment depends on the number of particles N . The dependence of the rate on N is discussed in Chapter 6. As discussed in section 1.3, study of the IPS requires the large deviation analysis of an occupation measure problem with a “time” dependent transition kernel. To distinguish a point in the underlying state space of the process from variables that identify stages and hence this time variable, we use t rather than x to correspond to the normalized limits of the stage under the scaling j/m. (Its use in this manner motivated the switch from µ(dy|x) to µ(dy|t) previously.) The local rate function GN (t, φ) introduced below will be explicitly defined in Section 4.3 [equation (4.22)], after we have discussed the weak limits of certain con- 28 29 ˆ m . The second argument φ trolled IPS models that appear in a representation for W N will be a probability measure on {1, . . . , N }. The local rate will depend on t since the number of particles which reach the next level and their locations depend on the stage through µ(dy|t). Thus the large deviation problem involves a probability measure-valued process with non-stationary increments, whose increments are essen- tially occupation measures that reflect the local properties (i.e., µ(dy|x) for x near t). The large deviation lower bound require a communication condition and a condition of locally uniform ergodicity on γ(·|v, t) when thought of as a transition kernel for each fixed t. These conditions are stated in Chapter 4. Remark 2.1. This thesis is concerned with the efficiency of the IPS system, and for this problem the function of interest is F N , Lm . However, the proof of Theorem 2.2 goes through without change if F N , Lm is replaced by any bounded and continuous function of Lm . One thereby obtains the full large deviations principle for {Lm } with the rate function Z 1 Z 1  IN (ν) = inf GN (t, φ (t)) dt : ν = φ (s) ds . 0 0 Theorem 2.2. Assume Conditions 1.1, 1.2, and 1.4 and define GN : [0, 1] × P ({0, . . . , N }) → [0, ∞) by (4.22). Then Z  N ˆm≥ lim inf W inf R inf GN (t, φ (t)) dt + F , ν . (2.2) N m→∞ ν∈P({1,...,N }) φ:ν= [0,1] φ(t)dt [0,1] If in addition Conditions 5.3 and 5.1 hold, then Z  ˆm lim sup W ≤ inf R inf GN (t, φ (t)) dt + F , ν N . (2.3) N m→∞ ν∈P({1,...,N }) φ:ν= [0,1] φ(t)dt [0,1] Part II Large Deviation Analysis 30 Chapter 3 Relative Entropy Representation The key to the proof of Theorem 2.2 is a convenient relative entropy/control repre- sentation. The first step is to introduce notation which focuses attention on the key events of interest, i.e., the particles make it to the next level and their distribution (via the “transportation” measure) and the resampling from among these particles. The next step is to identify a “local” ergodicity which is central to the form of the local rate function GN (t, φ). This will require several steps of rewriting the represen- tation to account for changes of variable, which are needed to rescale and properly center the “transportation” measure. Once this is done and a suitable representa- ˆ m using this representation tion is obtained, we can analyze the asymptotics of W N and weak convergence methods. The derivation of the representation is the purpose of the present chapter. 3.1 The Representation Formula We first define the single particle (underlying) transportation kernels at the prelimit level by n o αm j (dy |x ) = P Z m σmj ∈ dy |Z m 0 = x , (3.1) 31 32 where x ∈ Cjm \Cj+1 m , σm m m m j = inf{i ≥ 0 : Zi ∈ Cj+1 ∪ (−∞, −1/m]}, and where {Zi } is the underlying process model defined in (1.2). Note that αm j is a stochastic kernel m on (−∞, −1/m] ∪ Cj+1 m given Cjm \Cj+1 . Next we define a rescaled version of the original process according to   Z˘i+1 m = Z˘im + θi Z˘im /m . (3.2) It is easy to check that Z˘im = mZim if Z˘0m = mZ˘0m . The thresholds for the rescaled process are stretched versions of those used for the original process: C˘j = mCjm = [j, ∞). With notation that follows that of the original IPS model, we denote the location of particles before and after resampling at stage j by (Y˘j,1 m , . . . , Y˘j,N m ) and ˘ j,1 (X m ˘ m ). Then the related transportation kernel is defined by ,...,X j,N n o ˘ ˘m ˘m α j (dy m |x ) = P Zσ˘ m j ∈ dy Z0 = x n o = P mZσmm j ∈ dy |mZ0 m = x = αm j (dy/m |x/m) , where x ∈ C˘j \C˘j+1 and σ ˘m ˘m ˘ j = inf{i ≥ 0 : Zi ∈ Cj+1 ∪ (−∞, −1]}. In comparison ˘m with the original transportation kernel defined in (3.1), α j is now a stochastic kernel on (−∞, −1] ∪ C˘j+1 given C˘j \C˘j+1 . Note that both sets no longer depend on m. Later in this section a centering convention will be introduced, so that all the trans- portation kernels are on the same fixed space. For the rest of this chapter we will focus on the rescaled processes and related measures. This is without loss, in that all quantities of interest are the same for both systems. With an abuse of notation, we omit the ”^” from the notation. To obtain a control representation, we exploit an underlying Markovian structure. Slightly complicating the description and notation is the fact that the evolution interweaves the transportation and resampling steps, and it will be useful in the m m representation to keep them distinct. At any given stage j, (Xj,1 , . . . , Xj,N ) takes 33 values in CjN ⊂ RN . Given (Xj,1 m m , . . . , Xj,N ) = (x1 , . . . , xN ), the transition kernel for the transportation stage is Y N   Λm j (dy1 , . . . , dyN |x1 , . . . , xN ) = 1Cj \Cj+1 (xn )αm j (dyn |xn ) + 1Cj+1 (xn )δ xn (dyn ) . n=1 (3.3) m m m One can construct a measurable mapping gj+1 = (gnum,j+1 , gloc,j+1 ) such that m m m m gnum,j+1 (Yj+1,1 , . . . , Yj+1,N ) = Nj+1 m gloc,j+1 m (Yj+1,1 m , . . . , Yj+1,N ) = (Y˜j+1,1 m , . . . , Y˜j+1,N m m , 0, . . . , 0), j+1 where {Y˜j+1,i m } are the locations of particles that made it up to the next level, and the trailing zeros are place holders which keep the dimension of the vector equal to N . In fact, there are many such mappings, in that they may produce different orders for the Y˜j+1,k m , and the particular one used is not important. Now suppose that m given Nj+1 = n and Y˜j+1,k m = y˜k , k ≤ n, a random vector (σ 1 , . . . , σ N ) is generated according to the product measure Θ (· |n) = ⊗ni=1 δ i ⊗N i=n+1 Un , (3.4) where Un is the uniform measure on {1, . . . , n}. The role of the ⊗ni=1 δ i is also as a placeholder, and simply says that the first n particles have already been determined, and thus resampling is only used to determine the location of the remaining N − n particles. Then we can set   y˜ k = 1, . . . , n m k Xj+1,k = = y˜σk .  y˜ k = n + 1, . . . , N σk There is a measurable map hm m m m j that takes (gj+1 (Yj+1,1 , . . . , Yj+1,N ), σ 1 , . . . , σ N ) into m m (Xj+1,1 , . . . , Xj+1,N ). 34 With these definitions, the IPS process can be built on the following probability space. Let X = {(−∞, −1] ∪ [0, ∞)}N and Y = {1, . . . , N }N , and set Z = [X × Y]m . To simplify the notation, we denote corresponding vectors by bold symbols, e.g., x0 = (x0,1 , . . . , x0,N ). We define a probability measure on Z by Pm (dy1 , dσ 1 , . . . , dym , dσ m ) m m = Λm 0 (dy 1 |x 0 )Θ(dσ 1 g num,1 (y1 ))Λm 1 (dy 2 |hm m 1 (g1 (y1 ), σ 1 ))Θ(dσ 2 gnum,2 (y2 )) · · · . Note that the distribution induced by hm m 1 (g1 (y1 ), σ 1 ) when y1 and σ 1 have distri- m bution Λm m m 0 (dy1 |x0 )Θ(dσ 1 gnum,1 (y1 )) is the same as that of (X1,1 , . . . , X1,N ) given m m (X0,1 , . . . , X0,N ) = (x0,1 , . . . , x0,N ). We can then expand the expectation of (1.8) as Z Pm  m gnum,j (yj )  ˆm 1 2 j=1 log W N = − log e N Pm (dy1 , dσ 1 , . . . , dym , dσ m ). m Z Applying Proposition 0.5, the relative entropy representation for exponential in- tegrals, we have Z Pm  m gnum,j (yj )  1 2 j=1 log − log e N Pm (dy1 , dσ 1 , . . . , dym , dσ m ) m "Z Z   # 1 1 X m m gnum,j (yj ) = inf R (Qm kPm ) − 2 log Qm (dy1 , dσ 1 , . . . , dym , dσ m ) . Q m m m Z j=1 N We now factor Qm (dy1 , dσ 1 , . . . , dym , dσ m ) to reflect the way that Pm is factored. While Pm exhibits a type of Markovian structure, the same cannot be assumed of Qm . However, we can always factor it so that the resulting conditional distributions 35 do not anticipate the future. Thus we write ¯m Qm (dy1 , dσ 1 , . . . , dym , dσ m ) = Λ ¯m ¯m ¯m ¯m 0 (dy1 )Θ1 (dσ 1 )Λ1 (dy2 ) · · · Λm−1 (dym )Θm (dσ m ), ¯ m (dσ j ) ∈ P(Y) is implicitly a function of (y1 , σ 1 , . . . , yj ) and Λ where Θ ¯ m (dyj ) ∈ j j−1 P((−∞, −1] ∪ Cj ) is implicitly a function of (y1 , σ 1 , . . . , yj−1 , σ j−1 ). Denote the ¯ 1m , σ distribution of (Y ¯m ¯ m ¯m 1 , . . . , Ym , σ m m ) by Q . Then using the chain rule for rela- tive entropy Theorem 0.8 to decompose R(Qm kPm ), we can write the variational representation as " 1 X m−1 m m m m   inf ¯ E ¯m R Λ j (·) Λj (· hj ¯ j ), σ gj ( Y ¯mj ) (3.5) {(Λ¯ m j ,Θj+1 )} ¯m m j=0 !# 1 X 1 X m m ¯ jm ) m   m gnum,j (Y + ¯ m ¯ m R Θj (·) Θ(· gnum,j Yj ) − 2 log , m j=1 m j=1 N ¯ denotes expected value with respect to Qm . where E To proceed with the analysis, we need to introduce a centering in the previous definitions. The usefulness of the centering is due to the fact that the environment presented to each of the transportation and resampling measures is in some sense stationary relative to the starting location Cj of the particles at stage j. Our goal is analyze what amounts to an empirical measure on the number and locations of particles that make it to successive levels. For this to work out, the various measures should be defined on a fixed space that does not depend on j. In fact, it is only the difference of particle positions before and after transportation that we need to record. By a change of variable we can redefine the transportation kernel αm j so that rather than being on (−∞, −1] ∪ Cj+1 given a point in Cj \Cj+1 , we consider them as measures on the fixed space ∆ ∪ [0, ∞) given [0, 1), where ∆ will be called the “cemetery” state and plays the role of (−∞, −1], while [0, ∞) plays the role of Cj+1 . We make the analogous change with respect to the N-particle transportation kernels Λm m m m j defined in (3.3). Under this convention, the mappings gj = (gnum,j , gloc,j ) and 36 hm j now have domains and ranges g : {∆ ∪ [0, ∞)}N → {1, . . . , N } × [0, ∞)N h : {1, . . . , N } × [0, ∞)N × Y → [0, ∞)N . Thus g = (gnum , gloc ), where gnum counts the number of particles that make it up to [0, ∞) (termed success particles), and gloc returns the locations of these particles. Again we will abuse notation and use the same labels as before for these new centered measures. In terms of the original random variables, we can exhibit the centered and rescaled transportation kernel, for x ∈ [0, 1), as  n  o  P m m mZσjm − (j + 1) ∈ A (mZ0 − j) = x if A ⊂ [0, ∞) αm j (A |x ) = n  o .  P mZσmm j − (j + 1) ∈ (−∞, −1] (mZ 0 m − j) = x if A = {∆} (3.6) The transportation kernel is then Y N   Λm j (dy1 , . . . , dyN |x1 , . . . , xN ) = 1[0,1) (xn )αm j (dyn |xn ) + 1[1,∞) (xn )δ xn −1 (dyn ) . n=1 (3.7) Note that all the transportation kernels are measures on the same space, i.e., {∆ ∪ [0, ∞)}N . We thereby obtain the following variational representation, which is now amenable to weak convergence analysis: " 1 X m−1 m   VNm = inf ¯ E ¯m R Λ ¯ m ¯m j (·) Λj (· h g(Yj ), σ j ) (3.8) {(Λ¯ m j ,Θj+1 )} ¯m m j=0 !# 1 X 1 X m m ¯ jm )   gnum (Y + ¯m R Θj (·) Θ(· gnum ¯ Y m j ) − 2 log , m j=1 m j=1 N where VNm is referred as the minimal cost function. Without loss we can restrict to controls for which the cost is finite. This implies the finiteness of the relative entropies, which in turn implies that the resampling 37 ¯ m should have the same support as Θ(·|gnum (Y controls Θ ¯ jm )), w.p.1. Recalling that j Θ (· |n) = ⊗ni=1 δ i ⊗N ¯m i=n+1 Un , this means that Θj (· |ω ) will inherit the property that ¯jm = gnum (Y the resampling control leaves the N ¯ jm ) particles that made it up alone, and sample from among these particles (though perhaps with a distribution not ¯ m }) to determine the locations of the equal to the uniform distribution on {1, . . . , Nj ¯ jm ) particles. remaining N − gnum (Y We formally state the representation so derived. Theorem 3.1 (Representation Theorem). Let Θ, Λm m j and VN be defined by (3.4), (3.7) and (3.8), respectively. Then for all m ∈ N Ym  m 2 1 Nj ˆ Nm W = − log E m j=1 N is equal to the minimal cost function VNm . 3.2 The Limiting Behavior of the Underlying Transportation Kernels ˆ m and identify the rate function, we will need to To evaluate the asymptotics of W N identify the limit of Λm m j . The rescaled and centered transportation kernels αj (dy |x ) are related to the asymptotic first entrance distributions defined in (1.9). The fol- lowing lemma makes the relation precise. Lemma 3.2. Assume Conditions 1.1 and 1.4. Define αm j (dy |x ) by (3.6), and γ(dy|x, t) by (1.9), and assume that j/m → t ∈ (0, 1) as m → ∞. Then for x ∈ [0, 1) αm j (· |x ) ⇒ γ(·|x, t), 38 and thus for x ∈ [0, ∞) 1[0,1) (x)αm j (· |x ) + 1[1,∞) (x)δ x−1 (·) ⇒ γ(·|x, t). (3.9) Moreover, for each x ∈ [0, 1) t → γ(·|x, t) is continuous for t ∈ [0, 1]. Proof. We recall that αm j (· |x ) is defined in (3.6) by setting Zi+1 = Zi + θi (Zi /m), Z0 = j + x, (3.10) and then stopping this process the first time σ m j when either Zi ≥ j + 1 or Zi ≤ −1, with the latter event meaning that mass is assigned to ∆, and αm j (A |x) = P{Zσ m j ∈ j + 1 + A} when A ⊂ [0, ∞). Let b ∈ (0, 1), and define σm j (b) = inf{i ≥ 0 : Zi ≥ j + 1 or Zi ≤ jb}. The parameter b is used to localize the problem, so that we only have to pay attention to the process when Zi /m is very close to t. Under the stability condition (Condition 1.1), given any a > 0, for all sufficiently large m we have P{Zσm j ≥ j + 1 and Zσm j (b) ¯m ≤ jb} ≤ a. It follows that if we define “barred” measures by α j (A |x) = P{Zσm j (b) ¯m ∈ j + 1 + A} and α j (∆ |x) = P{Zσ m j (b) < j + 1}, then modulo an error in the total variation norm of size 2a, αm ¯m j (· |x) and α j (· |x ) have the same weak limit. We can also define γ¯ (·|x, t) in the analogous way, and obtain the same total variation ¯m norm error bound. Therefore, it suffices to show that α j (· |x ) ⇒ γ ¯ (·|x, t). ¯m Again using Condition 1.1, α j (A |x ) can be characterized as the unique solution 39 to the integral equation ¯m α j (A |x ) = P{Zσ m (b) ∈ j + 1 + A} n j o  = P Zσm ∈j+1+A 1{σm (b)>1} + 1 m  j (b) {σj (b)=1}∩ Zσm j (b) ≥j+1 j Z     j + x j + x = α m ¯ j (A |x + y )µ dy + µ A + 1 − x . {y:jb/m 0 there exists M < ∞, such that the probability that the random walk escapes from [−1, j + 1] in M steps is greater then ε uniformly in x, and thus   m P σm j (b) ≤ M Z0 = x ≥ P σ j ≤ M Z0 = x > ε. On the other hand, the first term in (3.13) can be bounded above in terms of the probability of not escaping in n steps. In fact, for n = kM + l, k ≥ 1, l < M Z ! X n m ¯j A α xi µ(dxn |t) · · · µ(dx1 |t) m Dn i=0  = P σm j (b) > n, Z n ∈ j + 1 + A x ≤ (1 − ε)k n o(M +1)k = [1 − ε]1/(M +1) ≤ ρn , where ρ = ρ(ε, M ) = [1 − ε]1/(M +1) < 1. Repeating the analysis for γ¯ , we get Z ! Xn γ A xi µ(dxn |t) · · · µ(dx1 |t) ≤ ρn . m Dn i=0 Pick b < 1 such that cb satisfies ρ/cb < 1. Then the series in (3.13) is convergent, and for n large enough ¯m α j (A |x)  n X ! n−1  k+1 Z X k ρ 1 ≤ + µ A+1− xi t µ(dxk |t) · · · µ(dx1 |t) cb cb Dkm k=0 i=0  n  n X n−1 Z ! ρ 1 Xk ≤ + µ A+1− xi t µ(dxk |t) · · · µ(dx1 |t) cb cb Dkm i=0  n  n k=0 ρ 1 ≤ + γ¯ (A |x, t). cb cb 41 With a similar computation we obtain the lower bound Z ! X n m n m ¯ j (A |x) ≥ cb α α¯j A xi µ(dxn |t) · · · µ(dx1 |t) m Dn i=0 Z ! X n−1 X k k+1 + cb µ A+1− xi t µ(dxk |t) · · · µ(dx1 |t) Dkm k=0 i=0 ≥ cnb n γ (A |x, t) − ρ ) . (¯ Combining these two inequalities gives n cnb (¯ γ (A |x, t) − ρn ) ≤ α ¯m γ (A |x, t) + ρn ) . j (· |x) ≤ (1/cb ) (¯ The claim now follows by sending m → ∞ (and using j/m → t), then b ↑ 1, and finally n → ∞. The last sentence of the lemma follows from a very similar construction, and is hence omitted. Chapter 4 The Large Deviation Upper Bound In this chapter we prove (2.2), which corresponds to a large deviation upper bound. We first rewrite that control representation (3.8) so that the controls and controlled process are in a space that is independent of m, which facilitates the use of weak convergence arguments. Tightness is addressed at the beginning of Section 4.2, and then weak convergence methods are used to relate the limits of the controls to the limits of the processes. It is this asymptotic relationship which identifies the local rate function, and in Section 4.3 these results are used to prove a large deviation upper bound with the given rate. 4.1 Notation and Preliminaries In this section we make a few final adjustments to the relative entropy representation. A number of stochastic kernels indexed by t ∈ [0, 1] will be needed. To simplify the notation, the definition at t = 1 is usually omitted. This is without loss, since these kernels are always integrated with respect to Lebesgue measure. ¯m Recall the control representation (3.8). Let X ¯ m ¯m ¯m j = h(g(Yj ), σ j ) and Nj = ¯ jm ). These quantities have the exactly same interpretation as for the original gnum (Y ¯jm is the number process, but with respect to the controlled distributions. Thus N 42 43 ¯m of particles making it to threshold j, and X j are the locations after resampling. We will also want to keep track of the dependence of the transportation control as ¯m a function of the state of the process X j . For m ∈ N and t ∈ [0, 1] define the stochastic kernel   . ¯ξ m (dx, dy |t ) = ¯ m (dy) if t ∈ j j+1 ¯ m (dx) × Λ δX j , , j = 0, . . . , m − 1, j m m ¯m where Λ ¯m ¯ ¯ 1, . . . , Y j (dy) = Λj (dy Y1 , σ ¯ j, σ ¯ j ). Also define   m . ¯ m (dσ) if t ∈ j j+1 m (dn) × Θ η¯ (dn, dσ |t) = δ N¯j+1 j+1 , , j = 0, . . . , m − 1, m m ¯m where Θ ¯m j (dσ) = Θj (dσ Y1 , σ ¯ j ). Random probability measures ¯ξ m ∈ ¯ ¯ 1, . . . , Y P([0, ∞)N × {∆ ∪ [0, ∞)}N × [0, 1]) and η¯m ∈ P({1, . . . , N } × Y × [0, 1]) are then defined by Z . ¯ξ m (A1 × A2 × B) = ¯ξ m (A1 × A2 |t) dt, (4.1) ZB . η¯m (C1 × C2 × B) = η¯m (C1 × C2 |t ) dt (4.2) B for Borel subsets A1 of [0, ∞)N , A2 of {∆ ∪ [0, ∞)}N , C1 of {1, . . . , N }, C2 of Y, and B of [0, 1]. Again, we use the bold uppercase letter to denote the subsets of higher dimensions. Similarly, measures related to the distributions used to define the original process are defined by   m . m j j+1 ¯ m ) if t ∈ r (dx, dy |t ) = δ X ¯ m (dx) × Λj (dy|X j , , j = 0, . . . , m − 1, j m m   m . ¯ ) if t ∈ m j j + 1 q (dn, dσ |t ) = δ N¯j+1 m (dn) × Θ(dσ|N j+1 , , j = 0, . . . , m − 1, m m 44 and Z m . r (A1 × A2 × B) = rm (A1 × A2 |t) dt, (4.3) ZB . q m (C1 × C2 × B) = q m (C1 × C2 |t) dt, (4.4) B where Ai , Ci , B are as above. The chain rule for relative entropy Theorem 0.8 is then used to rewrite (3.8) in terms of these measures. To be precise, we repeatedly use the fact that for probability measures and stochastic kernels defined on the appropriate spaces Z R (µ(dz|u)γ(du) kν(dz|u)γ(du)) = R (µ(·|u) kν(·|u)) γ(du) U m to insert the Dirac functions appearing in the definitions of ¯ξ , η¯m , rm and q m without any change in the cost. For ` ∈ P ({1, . . . , N }) recall the definition N X N n F ,` = −2 log ` (n) . n=0 N With the 1/m factors accounted for by the piecewise constant nature of the measures and the Lebesgue integral, (3.8) can be restated as Z  1  m   N m ˆm= W inf ¯ E ¯ m R ξ (· |t ) kr (· |t) + R (¯ m m ¯ η (· |t ) kq (· |t ))) dt + F , L , N {¯ξm ,¯ηm } 0 (4.5) where X m ¯m = 1 L δ ¯m (4.6) m j=1 Nj is the controlled empirical measure. The proof of Theorem 2.2 uses weak convergence arguments to analyze the limit of (4.5). Before proving the tightness results for our problem, it is useful to recall some criteria for the tightness of probability measures. In the lemmas to follow, V 45 is a Polish space and A is an arbitrary index set. A tightness function on V is a mapping g : V → [0, ∞] such that for each M ∈ [0, ∞), {v : g(v) ≤ M } is compact. The first result is given in [9, Theorem A.3.17]. Lemma 4.1. Let g be a tightness function on V. Define a function G (θ) mapping P (V) into R ∪ {∞} by Z G (θ) = gdθ. V Then the following conclusions hold. (a) For each M < ∞, the set ZG (M ) = {θ ∈ P (V) : G (θ) ≤ M } is tight. (b) G is a tightness function on P (V). The next result can be found in [20, Theorem 6.1]. Lemma 4.2. A family {ν a : (Ω, F, P) → P (V) , a ∈ A} of random probability measures on a Polish space V is tight if and only if the family of (deterministic) probability measures {Eν a , a ∈ A} ⊂ P (V) is tight. The next lemma gives an elementary sufficient condition for the tightness of empirical measures.  Lemma 4.3. Let Xj ∈ RN , j = 0, . . . m − 1 be a collection of random vectors. P Define the empirical measure J m = m1 m−1 j=0 δ Xj . Suppose that given ε > 0 there exists b < ∞ such that for all n ∈ {1, . . . , N } 1 X m−1 P (|Xj,n | > b) < ε. m j=0 Then {J m } is tight. Proof. By Lemma 4.2, it suffices to show that given δ > 0, there is a compact set K such that EJ m (Kc ) < δ for all m. Given δ, let ε = δ/N and then choose b so that 46 the previous display holds. Letting K = [−b, b]N , we have 1 X 1 XX m−1 m−1 N m c c EJ (K ) = P (Xj ∈ K ) ≤ P (|Xj,n | > b) < N ε = δ. m j=0 m j=0 n=1 4.2 Tightness and Characterization of Limits We now proceed to prove the tightness of both the underlying measure and the ad- missible control measures. Note that, owing to the rescaling and centering, measures such as αm j (dy |x ) are stochastic kernels on ∆ ∪ [0, ∞). Proposition 4.4. Assume Conditions 1.1 and 1.2. For each m ∈ N, consider any ¯m admissible control sequence {Λ ¯m j , j = 0, . . . , m − 1} and {Θj , j = 1, . . . , m} such that ( ) 1 X m−1   ¯ sup E R Λ¯m m ¯ m ¯m j (·) Λj (· h g(Yj ), σ j ) < ∞ m∈N m j=0 ( ) 1 Xm   sup E¯ R Θ ¯ m (·) Θ(· gnum Y ¯ jm ) < ∞, j m∈N m j=1 where Λ ¯ m (·|Y ¯ m (·) = Λ ¯ 1, σ ¯ j, σ ¯ 1, . . . , Y ¯ j ) and Θ ¯ m (·|Y ¯ m (·) = Θ ¯ 1, σ ¯ j ). Then ¯ 1, . . . , Y j j j j the following results hold. 1. {rm , m ∈ N} with rm defined in (4.3) is tight. m 2. Let the admissible control measures (¯ξ , η¯m ) be defined by equations (4.1) and m (4.2). Then {(¯ξ , η¯m ), m ∈ N} is tight. Proof. Part 1 (tightness of rjm ). We divide the proof for the tightness of rm into three steps. Step 1: the tightness of αm . Let H (x, α) be the moment generating function defined . in (1.4). By Condition 1.2, H (α) = supx∈[−1,1] H (x, α) < ∞. Given j = 0, . . . , m−1, 47 m m let {ˆθj,i (x) , i ∈ N, x ∈ [−1, 1]} be iid random fields with ˆθj,i (x) ∼ µ (· |x + j/m ) m and {Zj,i } be the Markov chain defined by  m Zj,i+1 m = Zj,i + ˆθj,i Zj,i m /m , m Zj,0 = z < 0. m m Then for each fixed α, the sequence Tj,i = exp{αZj,i − iH (α)} is a supermartingale m m with respect to the filtration Fj,i = σ{Zj,s : s ≤ i}. Indeed,  m m h n m  o i m ˆ m m E Tj,i+1 Fj,i = Tj,i E exp αθj,i Zj,i /m − H (α) Fj,i m  m  = Tj,i exp H Zj,i /m + j/m, α − H (α) m ≤ Tj,i . Define the stopping time τ m m j = inf{k > 0 : Zj,k ∈ / (−j − 1, 0)}. Then by the Optional Sampling Theorem, for α > 0 n o m m E exp αZj,τ m − τ j H (α) j ≤ eαz < 1. If we can prove that H (α∗ ) < 0 for some α∗ > 0, then Chebyshev’s inequality yields n o n o m −α∗ b ∗ m P Zj,τ m ≥ b ≤ e E exp α Z m j,τ j j ∗ n o ≤ e−α b E exp α∗ Zj,τ m m ∗ m − τ j H (α ) j ∗ ≤ e−α b , m which implies the tightness of {(Zj,τ m ) ∨ 0, j = 1, . . . , m, m ∈ N}. For later use, we j m note that Zj,i − z has the same distribution as the process Zi − (j + x) (equation (3.10)) that appears in the definition of αm j (·|x) when z = x − 1, and so we have −α b∗ αm j ([b, ∞)|x) ≤ e . (4.7) 48 We are now left to show the existence of such α∗ . According to Condition 1.1, Dα H (x, 0) = Eˆθ0,i (x) < 0 for all x ∈ [−1, 1]. Hence for each x there is αx > 0 such that H (x, α) < 0 for all α ∈ (0, αx ), and by the continuity of H we can also take αx to be continuous in x. Since [−1, 1] is compact, inf αx > 0, hence such α∗ exists. m Step 2: tightness of ¯ξ 1 . By assumption, there exists M < ∞ such that ( ) 1 X m−1    sup ER ¯m ¯ ξ kr m ¯ = sup E ¯ m (·) Λm (· h g(Y R Λ ¯ jm ), σ ¯m ≤ M. j j j ) m∈N m∈N m j=0 (4.8) According to the variational formula in Proposition 0.6 or Lemma 0.7(a), for any measurable function ψ bounded from below and for each m ∈ N, j ≤ m Z Z   ¯ E ¯m ψdΛ − log eψ dΛm ¯ Λ ≤ ER ¯ m Λm . (4.9) j j j j Define the measurable function   c yn > b ψ (y) = .  0 otherwise ¯ jm and (4.9) yield Then applying the inequality (4.7) to each component of Y Z   ¯ cP Y¯j,n m ≥b ¯ = E ¯m ψdΛ j    ¯ Λ ≤ ER ¯ m Λm + log 1 + ec e−α∗ b . j j ¯ m ∈ K is possible only if For n ∈ {1, . . . , N } and any Borel set K of [0, ∞), X j,n Y¯j,n m 0 0 ∈ K for some n ∈ {1, . . . , N }. Therefore, given any ε > 0 there are b < ∞ and c < ∞ such that 1 X ¯  ¯m 1 X X ¯ ¯m m−1 m−1 N NM N  ∗  P Xj,n ≥ b ≤ P Yj,n ≥ b ≤ + log 1 + ec−α b ≤ ε. m j=0 m j=0 n=1 c c (4.10) 49 m m With ¯ξ 1 playing the role of J m , Lemma 4.3 then yields the tightness of {¯ξ 1 }. 1 Pm−1 Step 3: tightness of rm . If we can show that the measures m j=0 Λm j also tight, then tightness of {rm } will follow. From the definition of Λm j (·|x) in (3.7) we have the upper bound X N Λm j (A1 × · · · × AN |x) ≤ [1[0,1) (xn )αm j (An |xn ) + 1[1,∞) (xn )δ xn −1 (An )]. n=1 −α b ∗ Recall that we have shown αm j ([b, ∞)|x) ≤ e . Given ε > 0, choose b, c < ∞ ∗ satisfying both e−α b < ε and (4.10) and let K = [b, ∞)N . Then ( ) 1 X m−1  ¯ E ¯m Λm Kc |X m j=0 j j ( m−1 N ) 1 XXh    i ≤ E¯ ¯ m αm [b, ∞) X 1[0,1) X ¯ m + 1[1,∞) X ¯ m δ X¯ m −1 ([b, ∞)) j,n j j,n j,n m j=0 n=1 j,n 1 X X ¯  ¯m m−1 N < ε+ P Xj,n ≥ b + 1 m j=0 n=1 < (N + 1)ε. Applying Lemma 4.2, we complete the proof of the tightness of {rm }. m Part 2 (tightness of {(¯ξ , η m j )}). Tightness of r m implies that for each i ∈ N ¯ m (Ki ) ≥ 1 − 2−i for all m. there exists a compact set Ki of X × Y such that Er Suppose we define a by ∞ X a (·) = 1 + i1Kci (·) . i=1 Then Z ∞ X ¯ sup E adrm ≤ 1 + i2−i = M1 < ∞. m∈N i=1 Letting U (·) = log a (·), it follows that U ≥ 0 and U has compact level sets. Thus 50 U is a tightness function such that Z ¯ sup E eU drm ≤ M1 . m∈N Then, again applying the variational formula as in (4.9), Jensen’s inequality, and the uniform bound (4.8), Z   Z  m ¯ sup E U d¯ξ ¯ ≤ M + sup E log e drU m m∈N m∈N Z  ¯ ≤ M + log sup E U e dr m m∈N < ∞. m Lemma 4.1 then implies the tightness of {¯ξ }. Recall that Y = {1, . . . , N }N . Since the measures η¯m take values in the compact η m } is automatic. set P({1, . . . , N } × Y × [0, 1), tightness of {¯ Next we state the main weak convergence result. Recall that for a probability measure such as ¯ξ on a product space, ¯ξ i will indicate the ith marginal. The first two statements are straightforward, and serve to establish notation and decompositions of the limit measures. The last two statements identify key equilibrium properties of the joint distribution of the location and number of the particles that make it to the next threshold. Specifically, equation (4.16) identifies a kind of local (in time or stage) stationary distribution on the particle locations after resampling and in the limit for the controlled IPS system, while (4.17) gives the corresponding distribution on the number of particles that make it to the next level. These relations, which link the equilibrium distributions to the limits of the controls, are the main ingredient in the definition of the local rate function for the IPS process. m  Lemma 4.5. Assume Conditions 1.1 and 1.4. Let ¯ξ , η¯m be the admissible control measures defined by equations (4.1) and (4.2) and let rm and q m be defined by (4.3) 51 ¯ m be the controlled empirical measure defined in (4.6), and and (4.4). Also, let L assume that Z  . 1  m   ¯ M = sup E R ¯ξ (· |t ) kr (· |t ) + R (¯ m m m η (· |t ) kq (· |t))) dt < ∞. (4.11) m∈N 0 m  Then every subsequence of ¯ξ , η¯m , rm , q m , L ¯ m has a subsubsequence that converges  in distribution to a limit ¯ξ, η¯, r, q, L ¯ with the following properties.   ¯ F, 1. There exists a probability space Ω, ¯ such that the limiting quantities ¯ξ, η¯ ¯ P can be realized, respectively, as stochastic kernels on [0, ∞)N ×{∆ ∪ [0, ∞)}N × ¯ Furthermore there are stochastic [0, 1] and {1, . . . , N } × Y × [0, 1] given Ω. kernels ¯ξ (dx × dy |t, ω ) and η¯ (dn × dσ |t, ω ) on [0, ∞)N × {∆ ∪ [0, ∞)}N and ¯ such that {1, . . . , N } × Y × [0, 1] given [0, 1] × Ω, ¯ξ (dx × dy × dt |ω ) = ¯ξ (dx × dy |t, ω ) dt, η¯ (dn × dσ × dt |ω ) = η¯ (dn × dσ |t, ω ) dt. 2. Define marginal distributions by ¯ξ (A1 |t, ω ) = ¯ξ(A1 × {∆ ∪ [0, ∞)}N |t, ω ), η¯1 (C1 |t, ω ) = η¯ (C1 × Y |t, ω ) , 1 ¯ξ 2 (A2 |t, ω ) = ¯ξ([0, ∞)N ×A2 |t, ω ), η¯2 (C2 |t, ω ) = η¯ ({1, . . . , N } × C2 |t, ω ) . ¯ (dy |x, t, ω ) on {∆ ∪ [0, ∞)}N given [0, ∞)N × Then there exist stochastic kernels Λ ¯ and Θ [0, 1] × Ω ¯ (dσ |n, t, ω ) on Y given {1, . . . , N } × [0, 1] × Ω, ¯ ¯ such that P-a.s. ¯ for ω ∈ Ω Z Z Z ¯ξ (A1 , A2 |t, ω ) dt = ¯ (A2 |x, t, ω ) ¯ξ 1 (dx |t, ω ) dt, (4.12) Λ ZB ZB ZA1 η¯ (C1 , C2 |t, ω ) dt = ¯ (C2 |n, t, ω ) η¯1 (dn |t, ω ) dt Θ (4.13) B B C1 52 for Borel subsets A1 of [0, ∞)N , A2 of {∆ ∪ [0, ∞)}N , C1 of {1, . . . , N }, C2 of Y, and B of [0, 1]. ¯ 3. Statements analogous to Parts 1 and 2 above hold for r and q, and for P-a.s. ¯ for ω ∈ Ω Z Z r (A1 , A2 , B |ω ) = Λ (A2 |x,t) ¯ξ 1 (dx |t, ω ) dt, (4.14) B A1 where Λ (· |x,t ) = ⊗ni=1 γ (· |xi ,t), and with γ defined by (1.9). Also, q (dn, dσ, dt) has the form Z Z q (C1 , C2 , B |ω ) = Θ (C2 |n) η¯1 (dn |t, ω ) dt, (4.15) B C1 where Θ (· |n) = ⊗ni=1 δ i ⊗N j=n+1 Un is the product measure defined in (3.4), and Ai , Ci and B take the same form as in Part 2. ¯ 4. P-a.s. ¯ we have for ω ∈ Ω, Z t Z tZ ¯ξ 1 (A |s, ω ) ds = ¯ (dσ |gnum (y) , s, ω ) ¯ξ 2 (dy |s, ω ) ds δ {h(g(y),σ)} (A) Θ 0 0 Z t Z tZ η¯1 ({n} |s, ω ) ds = 1{gnum (y)=n} ¯ξ 2 (dy |s, ω ) ds. 0 0 Therefore, a.s. with respect to Lebesgue measure Z ¯ξ 1 (A |s, ω ) = ¯ (dσ |gnum (y) , s, ω ) ¯ξ 2 (dy |s, ω ) (4.16) δ {h(g(y),σ)} (A) Θ Z η¯1 ({n} |s, ω ) = 1{gnum (y)=n} ¯ξ 2 (dy |s, ω ) . (4.17) Furthermore, for any subset C of {1, . . . , N } Z 1 ¯ L(C) = η¯1 (C |t, ω )dt. (4.18) 0 53 Remark 4.6. We suppress ω variable in later discussion from time to time, but keep in mind that the measures are random. m Proof. Part 1. Since (¯ξ , η¯m ) → (¯ξ, η¯) in distribution, by the Skorokhod Representa- m tion Theorem, we can assume that w.p.1 (¯ξ , η¯m ) ⇒ (¯ξ, η¯) on some probability space   ¯ F, (Ω, ¯ By definition, ¯ξ m ([0, ∞)N × {∆ ∪ [0, ∞)}N × dt) = ¯ξ m (dt) = λ (dt), ¯ P). 3 where λ is Lebesgue measure on [0, 1]. It follows easily from the convergence that   w.p.1 ¯ξ 3 (dt) = λ (dt). Hence by Theorem 0.9, there exists a stochastic kernel ¯ξ (dx × dy |t, ω ) on [0, ∞)N × {∆ ∪ [0, ∞)}N given [0, 1] × Ω ¯ such that w.p.1 ¯ξ (dx × dy × dt |ω ) = ¯ξ (dx × dy |t, ω ) dt. The decomposition of η¯ is proved in the same fashion. Part 2. The stated decompositions follow from another application of Theorem 0.9. Part 3. If (x, t) → γ(·|x, t) were continuous then (x, t) → Λ (· |x,t ) would also be continuous, and the form of r would follow from the fact that by Lemma 3.2 Λm ¯m ⇒ ¯ξ. Now in fact j (· |x) → Λ (· |x,t) uniformly, and the weak convergence ξ (x, t) → γ(·|x, t) is not continuous, though under the given conditions it is continuous in t and x → γ(·|x, t) is always right continuous and left continuous when x is not of the form (0, 1]∩[Z/q] (see Lemmas 1.5 and 3.2). Thus to obtain the decomposition via ¯ ¯ξ m ({[K c ]N }c ) can be made as Lemma 1.6, it is sufficient to show that under (4.11), E 1 δ small as desired by choosing δ > 0 sufficiently small, where Kδ = ∪qk=1 ([k/q]−δ, [k/q]) and Kδc = [0, ∞)\Kδ . We first claim that it is sufficient to show that given ε > 0, X X N m−1 ¯1 ¯m Λ E j,n (Kδ ) < ε (4.19) m n=1 j=0 ¯ m is the nth marginal of Λ for some δ > 0, where Λ ¯ m . Using the reasoning that lead j,n j 54 to (4.10), we have 1 X X ¯ ¯m m−1 N  ¯ ¯ξ m ({[K c ]N }c ) ≤ E P Xj,n ∈ Kδ 1 δ m j=0 n=1 1 X X ¯ ¯m m−1 N  ≤ N P Yj,n ∈ Kδ m j=0 n=1 X X N m−1 ¯1 = NE ¯ m (Kδ ). Λ j,n m n=1 j=0 ¯ 1 Pm−1 Λ We next show that for each n = 1, . . . , N , E ¯ m (Kδ ) can be made small, m j=0 j,n thus (4.19). Let Λm m j,n (·|x) = 1[0,1) (x)αj (·|x) + 1[1,∞) (x)δ x−1 (·. Applying once again Lemma 0.7(a) as in (4.9) and Jensen’s inequality, for any bounded measurable func- tion f (Z ! Z !) 1 X m−1 1 X m−1  ¯ E fd ¯m Λ j,n (·) − log ef d Λm j,n · h g(Y ¯ jm ), σ ¯mj m j=0 m j=0 ( ) 1 X m−1 m  ¯ ≤E ¯m R Λ j (·) Λj · h g(Y ¯ jm ), σ ¯mj m j=0 ≤ M. Assume that given ε, there exist δ > 0 so that X m−1 ¯1 Λm ¯m E j,n (Kδ |Xj,n ) < ε. (4.20) m j=0 Substituting into the last display the function f that equals 0 on Kδc and log (1 + 1/ε) 55 on Kδ , we have ! 1 X ¯m m−1 ¯ E Λ (Kδ ) m j=0 j,n " !#! 1 X m m−1 1 m ≤ ¯ log 1 + 1 M +E Λ Kδ X ¯ j,n log (1 + 1/ε) ε m j=0 j,n 1 ≤ (M + log 2) . log (1 + 1/ε) This is made as small as desired by choosing ε > 0 small. We complete by proving the claim (4.20). Recall that by definition Λm j,n (·|x) = 1[0,1) (x)αm m j (·|x) + 1[1,∞) (x)δ x−1 (·), and that αj has a uniformly bounded density with respect to Lebesgue measure everywhere except (0, 1] ∩ [Z/q] (this follows the same proof as for γ(·|x, t)). Thus by the right continuity of x 7→ αm m j (·|x), αj (Kδ |x) is proportional to δ for all x save those in intervals of the form ([k/q], [k/q] + δ). ¯m However, the only way a particular component of X j can be in this set (at least if δ < 1/q) is if one of its summands in the random walk that ended at that component was not in the support of the discrete measure. When this happens, the distribution of this particular component has a bounded density with respect to Lebesgue mea- sure, and hence the probability that it falls in Kδ is proportional to δ. Therefore, ¯ 1 Pm−1 Λm (Kδ |X E ¯ m ) is itself proportional to δ, which completes the argument. m j=0 j,n j,n The argument for (4.15) is similar but much simpler since the measure is on a discrete space and is omitted. Here we use that at the prelimit q m (dn, dσ, dt) = δ N¯dmte m ηm (dn) Θ (dσ |n) dt = [¯ 1 (dn |t) dt] Θ (dσ |n) . Part 4. For t ∈ [0, 1] define bmtc bmtc ¯m 1 X ¯ m 1 X Jt = δ ¯m m , Lt = δ ¯m . m j=1 h(g(Yj ),¯σj ) m j=1 gnum (Yj ) 56 Then for bounded and continuous functions u1 mapping [0, ∞)N into R and u2 mapping {1, . . . , N } into R, by definition Z bmtc Z bmtc/m Z 1 X  m u1 (x) J¯tm (dx) = ¯ m u1 Xj = u1 (x) ¯ξ 1 (dx| s) ds, m j=1 0 Z bmtc Z bmtc/m Z ¯ m (dn) = 1 X ¯ m  u2 (n) L t u2 Nj = u2 (n) η¯m 1 (dn| s) ds. m j=1 0 This implies Z Z   t t J¯tm ¯m (dx) , L (dn) → ¯ξ (dx |s) ds, η¯1 (dn |s ) ds . t 1 0 0 For n ∈ N and j ∈ {0, 1, . . . , m − 1}, we define F¯jm to be the σ-field generated by ¯ m, σ {(Y m ¯m ¯m ¯m k ¯ k ), k ≤ j} and Gj to be that generated by Fj−1 and Yj . Then Z  m  u1 ¯ m h g Yj+1 , σ ¯m ¯ j+1 − u1 (h (g (y) , σ)) Θ ¯m j+1 (dσ)Λj (dy), Z  u2 gnum (Y m ¯ m (dy) ¯ j+1 ) − u2 (gnum (y)) Λ j are martingale differences with respect to F¯jm and G¯jm , respectively. Since Z Z bmtc/m Z m u1 (x) J¯tm (dx) − u1 (h (g (y) , σ)) η¯m (dσ |gnum (y) , s) ¯ξ 2 (dy |s) ds 0 bmtc bmtc−1 Z 1 X ¯j − m  1 X ¯ m (dσ)Λ ¯ m (dy) = u1 X u1 (h (g (y) , σ)) Θ j+1 j m j=1 m j=0 bmtc−1  Z  1 X  ¯ (dσ)Λ ¯ j+1 − u1 (h (g (y) , σ)) Θ m m ¯ (dy) . m = u1 X j+1 j m j=0 Denote the left hand side of the above equalities by Rm . Then Chebyshev’s inequality 57 and a standard conditioning argument yield that for any ε > 0 ¯ {|Rm | ≥ ε} P    1 bmtc−1 X   Z   ¯ = P ¯ m ¯ m ¯ m u1 Xj+1 − u1 (h (g (y) , σ)) Θj+1 (dσ)Λj (dy) ≥ ε  m  j=0     bmtc−1  X Z  2 1¯ 1   ≤ 2E u1 X¯m j+1 − ¯m u1 (h (g (y) , σ)) Θ ¯m j+1 (dσ)Λj (dy)  ε m 2 j=0     bmtc−1  Z 2  1¯ 1 X  = 2E u1 X¯m j+1 − ¯m u1 (h (g (y) , σ)) Θ ¯m j+1 (dσ)Λj (dy) ε  m2 j=0  4 kuk2∞ ≤ . mε2 ¯ m , and so (4.16) and (4.17) follow. Lastly A similar calculation can be done for L t ¯m = L (4.18) follows from the fact that L ¯ m. 1 4.3 Proof of the Lower Bound (2.2) In this section, we introduce some related concepts, identify the rate function and give the prove for the first part to Theorem 2.2, i.e. (2.2). Definition 4.7 (Feasible). Probability measures ξ ∈ P([0, ∞)N × {∆ ∪ [0, ∞)}N ) and η ∈ P({1, . . . , N } × Y) are called feasible if the analogues of (4.16) and (4.17) hold,i.e., Z ξ 1 (A) = δ h(g(y),σ) (A) η (dσ |gnum (y)) ξ 2 (dy) Z η 1 ({n}) = 1{gnum (y)=n} ξ 2 (dy) , (4.21) where ξ i and η i are the marginals and η(dσ|n) is defined by η(dn×dσ) = η 1 (dn)η(dσ|n). . For φ ∈ P({1, . . . , N }), define A (φ) = {feasible (ξ, η) such that η 1 = φ}. We 58 now give the form of local rate function GN in Theorem 2.2: Z Z  GN (t, φ) = inf R (ξ(·|x) kΛ(· |x,t)) ξ 1 (dx) + R (η(·|n) kΘ(· |n)) η 1 (dn) , (ξ,η)∈A(φ) (4.22) where Λ (A1 × · · · × An |x,t ) = ⊗ni=1 γ (Ai |xi ,t ). m  Let ¯ξ , η¯m give a value that is within ε > 0 of the infimum in (4.5). Tightness of m  { ¯ξ , η¯m , rm , q m } has been proved along any sequence for which the relative entropy ˆ m , this can be costs are bounded. For the purposes of proving a lower bound on W N ¯ m } follows from the fact that L assumed without loss. Tightness of {L ¯ m takes values in a compact set. We now evaluate the limit inferior of W ˆ m along the subsequence N m   m m m ¯ → ¯ξ, η¯, r, q, L of n ∈ N for which ¯ξ , η¯ , r , q , L m ¯ . Then lim inf Wˆm+ε N m→∞ Z 1   m   N m ≥ lim inf E ¯ ¯ m R ξ (· |t ) kr (· |t ) + R (¯ m m η (· |t ) kq (· |t )) dt + F , L ¯ m→∞ 0   = lim inf E ¯ R ¯ξ m krm + R (¯ η m kq m )) + F N , L ¯m m→∞   ≥ E ¯ R ¯ξ kr + R (¯ η kq)) + F N , L ¯ Z 1     N = E ¯ ¯ R ξ(· |t) kr(· |t) + R (¯ η (· |t ) kq(· |t )) dt + F , L ¯ 0 Z 1  N ≥ E ¯ GN (t, η¯1 (· |t )) dt + F , L ¯ 0 Z 1 Z 1  N ≥ inf GN (t, φ (t)) dt + F , ν : φ (·) ∈ P ({1, . . . , N }) , ν = φ (s) ds . 0 0 The first line of this display is formula (4.5). The second line follows from the chain rule Lemma 0.7(f). The third line is a consequence of the convergence in m   distribution of ¯ξ , η¯m , rm , q m , L ¯ m → ¯ξ, η¯, r, q, L ¯ , the nonnegativity and lower semicontinuity of relative entropy, Fatou’s Lemma, and the non-negativity of F N . The forth line is again the chain rule, and the fifth is due to the almost sure feasibility implied by (4.16) and (4.17) and the definition of GN . The final line follows from (4.18). 59 Note that the proof of the Laplace upper bound naturally provides a candidate for the rate function, which is defined for ν ∈ P ({1, . . . , N }) by Z 1 Z 1  IN (ν) = inf GN (t, φ (t)) dt : ν = φ (s) ds , 0 0 and with GN (t, φ) defined as in (4.22). Since ε > 0 is arbitrary, we have proved that every subsequence of the original ˆ m , m ∈ N} has a subsubsequence satisfying sequence {W N  ˆm≥ lim inf W inf IN (ν) + F N , ν . N m→∞ ν∈P({1,...,N }) An argument by contradiction applied to an arbitrary subsequence shows that the ˆ m , m ∈ N} satisfies this same lower limit. entire sequence {W N Chapter 5 The Large Deviation Lower Bound In this chapter we prove (2.3), which corresponds to a large deviation lower bound. To prove this bound, one starts with a near minimizer in the right hand side, and based on this constructs a control that can be applied in the prelimit stochastic con- trol representation. A key qualitative property that is used to prove the convergence of the costs is some kind of ergodicity. As we will see, the nonstandard features of the problem that made the proof of the large deviation upper bound difficult complicate the analysis here as well. In particular, each of the following properties requires the introduction of new techniques and methods of analysis: the transition kernel on thresholds is in general not Feller; the transition kernels depend both on the large deviation parameter m and on the threshold index; and the underlying process includes the absorbing state ∆. We begin by discussing the ergodicity properties of the limit transition kernels. 5.1 Some Important Assumptions and Ergodicity of the Limit Underlying Transition Kernel To establish the large deviation lower bound, we need to introduce additional con- ditions on the underlying measures µ(dy|t) and limit transition kernel γ defined in 60 61 (1.9). Note that since γ contains a cemetery state ∆, the relevant ergodicity prop- erties are with respect to the kernel that excludes the absorbing set. Note also that under Conditions 1.1, 1.3 and 1.4, γ (∆ |x, t) is bounded away from 0 and 1, uni- formly in t ∈ [0, 1] and x ∈ [0, 1). Define a stochastic kernel γ˜ on P ([0, ∞)) given [0, ∞) × [0, 1] by γ (A |x, t) γ˜ (A |x, t) = 1 − γ (∆ |x, t) for Borel sets A ⊂ [0, ∞). Some type of transitivity or communication condition of the following sort is standard in the large deviation theory for the empirical measures of a Markov chain. Note that the assumption is made for each t due to the nonstationarity of the process, but under Condition 1.4 the condition holds for one t if and only if it holds for all t ∈ [0, 1]. Also, we can restrict to initial conditions x ∈ [0, 1) since for initial conditions x ≥ 1 the dynamics of the chain simply move the process to x − 1 with probability one. Condition 5.1 (Transitivity). For each t ∈ [0, 1], there exist positive integers l0 and n0 such that for all x, ζ ∈ [0, 1) X∞ X∞ 1 (i) 1 (j) i γ˜ (dy |x, t)  γ˜ (dy |ζ, t ) . i=l 2 j=n 2j 0 0 Here, γ˜ (k) is the k-step transition probability defined recursively by Z (k+1) γ˜ (A |x, t) = γ˜ (A |y, t) γ˜ (k) (dy |x, t) for Borel sets A, with γ˜ (1) = γ˜ . An immediate consequence of Condition 5.1 is the indecomposability [1, Section 7.3] of γ˜ which is shown in the next proposition. The idea of the proof will be used again in the next section when we show that the IPS transition kernel, involving N particles, also satisfies a “transitivity”condition. 62 Proposition 5.2. Assume Condition 5.1. Then for each t ∈ [0, 1] γ˜ is indecompos- able, i.e., there do not exist disjoint Borel subsets A1 and A2 of [0, ∞) such that γ˜ (A1 |x, t) = 1 for all x ∈ A1 and γ˜ (A2 |ζ, t ) = 1 for all ζ ∈ A2 . (5.1) Proof. We suppress t in the notation since it plays no role in the proof. Assume (5.1) holds. Then because the dynamics shift x to x − 1 when x ≥ 1, there exist x ∈ [0, 1) ∩ A1 and ζ ∈ [0, 1) ∩ A2 such that γ˜ (A1 |x) = 1 and γ˜ (A2 |ζ ) = 1. It then follows from (5.1) that for any l and n in N we have γ˜ (l) (A1 |x ) = 1 and γ˜ (n) (A2 |ζ ) = 1. According to Condition 5.1, the first of these equalities with l = l0 guarantees that there exists j ≥ n0 such that γ˜ (j) (A1 |ζ ) > 0. Since A1 and A2 are disjoint, for ζ ∈ A2 this inequality is incompatible with the equality γ˜ (n) (A2 |ζ ) = 1 for any n ∈ N. Hence the indecomposability of γ˜ (· |x ) follows. In the proof of the lower bound, ergodicity will be used to derive various limits. However, owing to the nonstationarity there will be contributions from several dif- ferent Markov processes, with the final state for one serving as the initial state for the next. For this reason, we need the following type of condition which ensures that the bounds coming from each process are independent of the starting state. See the discussion in [9, Section 8.4]. Condition 5.3 (Uniformity). For any t ∈ [0, 1] and x ∈ [0, 1) there exists a proba- bility measure β on [0, ∞), k ∈ N, and c > 0 such that for all ζ ∈ [0, 1) satisfying |x − ζ| < c and for all Borel sets A X k γ˜ (i) (A |ζ, t ) ≥ cβ (A) . i=1 63 5.2 The Construction of Ergodic IPS Transition Kernels As mentioned in the previous section, the underlying transition kernel γ contains a cemetery state. In order to introduce additional ergodicity property, we focused our discussion on γ˜ , a conditioned version of γ by eliminating the cemetery state ∆. In this section, we move on to a group of IPS-type processes, which share the same transportation measure related to γ but have different resampling mechanisms. Again, such processes contain a cemetery states {∆}N . However, based on the as- sumptions of γ˜ in the previous section, the renormalized version of the IPS-type processes also satisfy transitivity and uniformity conditions. Furthermore, such pro- cesses are geometrically ergodic. Let T1 = [0, ∞)N , T2 = {[0, ∞), ∆}N . Recall that for any admissible control with finite cost, the transportation control must be supported on T2 \{∆}N , i.e., with probability one at least one particle makes it to the next threshold. Hence it is relevant to study the renormalized transportation measure on T2 \{∆}N defined by ˜ (·|x, t) = Λ (·|x, t) ⊗N i=1 γ (· |xi , t) Λ = . (5.2) 1 − Λ({∆}N |x, t) 1 − ⊗Ni=1 γ (∆ |xi , t) Denote c (x, t) = 1 − ⊗N i=1 γ (∆ |xi , t). Since γ (∆ |x, t) is bounded away from 0 and 1 uniformly in t ∈ [0, 1] and x ∈ [0, 1), c (x, t) is bounded away from 0 and 1 uniformly ˜ together with any resampling in t ∈ [0, 1] and x ∈ T1 \[1, ∞)N . We now consider Λ ˜  Θ, and denote the transition kernel by p˜ (x, · |t) on T1 measure that satisfies Θ given T1 , i.e. Z p˜ (x, · |t ) = ˜ (dσ |gnum (y)) Λ(dy δ h(g(y),σ) (·) Θ ˜ |x, t). (5.3) T2 ×Y Note the ambiguity of the notation: p˜ can be associated with a number of resampling 64 ˜ measures but always uses the transportation measure Λ. ˜ such that Θ Proposition 5.4. Consider any resampling measure Θ ˜  Θ, and as- sume Conditions 1.1, 1.2 and 1.4. Then for each t ∈ [0, 1], p˜ (x, · |t ) has an invariant measure. Proof. The main difficulty is due to the fact that the process does not satisfy the ˜m Feller property. For m ∈ N, j = 0, . . . , m − 1, and X 0 = 0, define random variables ˜m X ˜m ˜m ˜m 0 , Y1 , X1 , Y2 , . . . recursively via the requirements n o   ˜ m ˜m ˜ ˜ m ˜ m P Yj+1 ∈ · Xj = x = Λ (· |x,t) , Nj+1 = gnum Yj+1 n o   ˜m ˜ ˜ ˜ P σm j+1 ∈ · Nj+1 = n = m m m Θ (· |n, t) , Xj+1 = h g(Yj+1 ), σ j+1 . Let 1 X m J˜m = δ ˜m, m j=1 Xj and denote X m ˜ξ m (dx,dy) = 1 ˜ (dy |x,t) . δ ˜ m (dx) × Λ m j=1 Xj Then for any bounded and continuous function u : T1 → R, Z Z 1 X  ˜ m m m ˜m u (x) J (dx) = u Xj = u1 (x) ˜ξ 1 (dx) . m j=1 ˜ (·|x, t) kΛ(· |x,t)) = − log c (x, t) < ∞ for all x and Note that by construction R(Λ ˜  Θ implies R(Θ(·|n, t, and Θ ˜ ˜ and Θ are probability t) kΘ(· |n)) < ∞ since Θ m m measures on a finite set. Thus by Proposition 4.4 {˜ξ } is tight. Let ˜ξ converge in distribution to a limit ˜ξ along a subsequence. Along that same subsequence obviously J˜m (dx) → ˜ξ 1 (dx) . We can then use the martingale difference argument and Condition 1.4 to deal with 65 the places where the Feller property fails just as in Part 4 of Lemma 4.5, and obtain Z Z Z u (x) ˜ξ 1 (dx) = ˜ (dy |x,t) ˜ξ 1 (dx) ˜ (dσ |gnum (y)) Λ u(h (g (y) , σ))Θ w.p.1. We now use the facts that there is a countable set of measure determining functions u to conclude that w.p.1 Z ˜ξ 1 (A) = p˜ (x, A |t ) ˜ξ 1 (dx) for all Borel sets A, i.e., ˜ξ 1 is an invariant measure of p˜ (x, · |t). For t ∈ [0, 1], let π ˜ (dx |t) denote any of the invariant distributions of p˜(x, · |t ) just constructed. The next lemma shows that there is in fact only one such invariant probability and that the associated Markov process is geometrically ergodic, as well as other properties that will be needed later. ˜ such that Θ Lemma 5.5. Consider any resampling measure Θ ˜  Θ, and assume Conditions 1.1, 1.2, 1.4, 5.1 and 5.3. The following conclusion hold for each t ∈ [0, 1]. 1. The Markov process associated with the transition kernel p˜(x, · |t) and initial ˜ (· |t ) is ergodic. distribution π 2. The transition kernel p˜ satisfies a transitivity condition of the same form as γ˜ : there exist positive integers n0 and l0 X∞ X∞ 1 (i) 1 (j) i 0 p˜ (x,dx |t)  j p˜ (ζ, dζ 0 |t ) , i=l 2 j=n 2 0 0 for x, ζ ∈ [0, 1)N . 3. The transition kernel p˜(x, · |t ) satisfies the following uniformity condition. Given x ∈ [0, 1)N there exists a probability measure β˜ on [0, ∞)N , k˜ ∈ N and c˜ > 0 66 such that for all ζ ∈ [0, 1)N satisfying kζ − xk ≤ c˜ and for all Borel sets A, ˜ X k p˜(i) (x, A |t) ≥ c˜β˜ (A) . i=1 In particular, the precompact set [0, 1)N is a small set [21, (5.14)] of the Markov process associated with transition kernel p˜. 4. The Markov process associated with transition kernel p˜ is geometrically ergodic. 5. Let A be a Borel set of T1 such that p˜(l0 ) (x0 , A |t ) > 0 for some x0 ∈ T1 and l0 ∈ N. Then π ˜ (A |t) > 0. Proof. To simplify the notation we omit t. Part 1. We first show that p˜(x, ·) is indecomposable. Ergodicity then follows from [9, Theorem A.4.5]. The proof is by contradiction. Suppose that there exist disjoint sets Ai ⊂ T1 , such that p˜(xi , Ai ) = 1 for all xi ∈ Ai , i = 1, 2. Define [ [ N Bi = {xn : xi = (x1 , . . . xN )} , xi ∈Ai n=1 i.e., Bi is the collection of all possible individual particle locations for joint N -particle locations xi in Ai . Note that Ai ⊂ {Bi }N , and therefore p˜(xi , {Bi }N ) = 1 for all xi ∈ Ai . Since γ (∆ |xi ) < 1 for all xi ∈ Bi , we must have γ (Bi |xi ) > 0 for all xi ∈ Bi . First, we claim that γ˜ (Bi |xi ) = 1 for all xi ∈ Bi , i = 1, 2. If not, then without loss of generality we may assume i = 1, and that there exists z ∈ B1 and C1 ⊂ [0, ∞) with C1 ∩ B1 = ∅, such that γ (C1 |z ) > 0. Let x ¯ ∈ A1 have z as its nth component. Consider the set C1 = {B1 }n−1 ⊗ C1 ⊗ {B1 }N −n ⊂ T1 . 67 By construction C1 ∩ A1 = ∅, but Z x, C1 ) ≥ p˜ (¯ ˜ (dσ |gnum (y)) Λ(dy δ h(g(y),σ) (C1 ) Θ ˜ |¯ x) C1 ×Y Y N ≥ γ (C1 |z ) γ(Bi |¯ xj ) j=1,j6=n > 0, which is a contradiction. Hence, γ˜ (Bi |x) = 1 for all x ∈ Bi . With this property established, if B1 and B2 were disjoint then we would have a contradiction to Proposition 5.2. We claim even more, which is that [0, 1)∩Bi , i = 1, 2 are not disjoint. Indeed, if x ∈ B1 ∩ B2 , then it is automatic that x − bxc ∈ [0, 1) ∩ B1 ∩ B2 . Thus we may suppose that there exists x0 ∈ B1 ∩ B2 ∩ [0, 1), which, without loss of generality, is the first component of some xi ∈ Ai , for i = 1, 2. Let k = max {dxi,n e : xi = (xi,1 , . . . , xi,N )} . (5.4) i,n We now construct an event E which satisfies p˜(k) (xi , E) > 0 for i = 1, 2. An event on path space will be constructed, which will imply the existence of an event of the desired form. We start at time 0 with particles in locations xi with i = 1, 2. At each time ` = 0, . . . , k − 1, we require that any particle that is a descendant of the particle originally in the first location be returned to [0, ∞), i.e., not to ∆. This can always be done with uniformly positive probability. All other particles are dealt with as follows: • If a particle is in [0, 1) but not a descendant then it moves to the state ∆. • If a particle is at x ∈ [1, ∞) then it is moved to x − 1. Again both events can be realized with positive probability. The definition of k in (5.4) guarantees that at the end of k steps, all the descendants other than those 68 originate from the x0 will be sent to the cemetery state ∆. In other words, we have identified an event with strictly positive probability defined in terms of the descen- dants of the particle originally at x0 , all of which have exactly the same distribution for the two different initial conditions, and hence such a set E exists. Part 2. We now prove that p˜(x, dy) also satisfies a transitivity condition with the same l0 as in Condition 5.1. That is, we need to show: ¯ 0 < ∞ such that, given any x ∈ [0, 1)N , ζ ∈ [0, 1)N there exists a fixed n and A ⊂ T1 with p˜(i) (x, A) > 0 for some i ≥ l0 , there is j ≥ n ¯ 0 such that p˜(j) (ζ, A) > 0. Without loss of generality, we can assume A ⊂ T1 is a rectangle, i.e., A = A1 × · · · × AN . In this case, for each An there exists yn ∈ {x1 , . . . , xN } such that γ˜ (i) (An |yn ) > 0 for the same i ≥ l0 . Condition 5.1 implies that for ζ n ∈ [0, 1) there exists jn ≥ n0 such that γ˜ (jn ) (An |ζ n ) > 0. The problem now becomes finding a common j for all ζ n . Using the same occupation measure argument as in Proposition 5.4, one can construct an invariant distribution for γ˜ . Let κ ˜ denote this measure. We claim that if γ˜ (i) (A|x) > 0 for some i ≥ l0 , x ∈ [0, ∞) and Borel set A, then κ ˜ (A) > 0. This is due to the following standard argument. Define the function ι mapping [0, 1) into N by  ι (ζ) = min j ∈ N : j ≥ n0 and γ˜ (j) (A|ζ) > 0 . Condition 5.1 guarantees that ι(ζ) is finite for all ζ. Furthermore, ι(ζ) is measurable. [0, 1) can be written as disjoint union of Borel sets Σ(j) , where Σ(j) = {ζ : i (ζ) = j} . ˜ (Σ(j) ) > 0 for some j ∈ N satisfying j ≥ n0 . The definition of It must be true that κ 69 Σ(j) implies that γ˜ (j) (A|ζ) > 0 for all ζ ∈ Σ(j) . Hence, Z Z (j) κ ˜ (A) = γ˜ (A |ζ ) κ ˜ (dζ) ≥ γ˜ (j) (A |ζ ) κ ˜ (dζ) > 0. Σ(j) This proves the claim. Continuing with the same set A, we note that since γ˜ possesses an invariant prob- ability measure and satisfies the transitivity condition, the corresponding Markov chain is ergodic. It follows from the Ergodic Theorem that Z 1 X (i) n γ˜ (A |ζ ) κ ˜ (dζ) → κ ˜ (A) > 0, A n i=1 and hence there is k ∈ N such that reasoning then gives Z γ˜ (k) (A |ζ ) κ ˜ (dζ) > 0. (5.5) A We now return to the proof of the transitivity. According to the previous para- graph, for each n there is kn ∈ N such that (5.5) holds with k and A replaced by kn and An . Also, we can find {c1 , . . . , cN } ∈ NN satisfying j1 + c1 k1 = · · · = jN + cN kN ≥ n0 . It then follows that with n ¯ 0 equal to the common value jn + cn kn , p˜(¯n0 ) (ζ, A) Z Z Z  Q N  ≥ ··· γ˜ (jn ) (dz0 |ζ n ) γ˜ (kn ) (dz1 |z0 ) · · · γ˜ (kn ) dzcn zcn−1 n=1 An An An > 0. Part 3. Let d = supx∈[0,1),t∈[0,1] γ(∆|x, t) < 1. Without loss of generality, assume 70 that A is a rectangle, i.e., A = A1 × · · · × AN . Then p˜(i) (ζ, A) ≥ (1 − d)iN ⊗N ˜ (i) (An |ζ n ) . n=1 γ Let k˜ = k, c˜ ∈ (0, c(1 − d)kN ) and β˜ be the product of β. The uniformity property then follows for small enough but positive c˜ by Condition 5.3. Part 4. It is easy to check that p˜ is ψ-irreducible [21, Proposition 4.2.2] by the uniformity property proved in Part 3. Furthermore, [0, 1)N is a small set of the process. Geometric ergodicity then follows from [21, Theorem 15.0.1]. Part 5. The proof is essentially the same as the argument used in Part 2 to show that γ˜ (i) (A|x) > 0 implies κ ˜ (A) > 0. See also [9, Lemma 8.6.2]. 5.3 The Construction of Ergodic Controls In this section, we will show how to construct controls from the renormalized un- derlying transition kernel p˜ which inherits the ergodicity property, as in Lemma 5.6. Moreover, how to reconstruct a (state-dependent) random walk from a given transition kernel of a single particle, as in Lemma 5.7. The following lemma tells us that given any target distribution on the number of successful particles, we can find a control with cost close to the optimum and for which the associated Markov process is geometrically ergodic and the transportation cost is bounded as a function of x. Let Z ˜ ({n} |t ) = φ ˜ (dy |x,t ) π 1{gnum (y)=n} Λ ˜ (dx |t ) , (5.6) the distribution of the number of success particles associated with the renormalized ˜  Θ, i.e., measures of the underlying process model and a resampling distribution Θ ˜ Θ, (Λ, ˜ π ˜ ). In the lemma k·kv denotes the total variation norm. Again we suppress the t variable in the argument. 71 Lemma 5.6. Assume Conditions 1.1, 1.3, 1.4, 5.1 and 5.3 and consider any ϕ ∈ P {1, · · · , N } satisfying GN (t, ϕ) < ∞, where GN is the local rate function defined in (4.22). 1. For (ξ, η) ∈ A (ϕ) suppose that R (ξ kξ 1 ⊗ Λ) + R (η kϕ ⊗ Θ) < ∞, and recall that by the definition of feasibility ξ 1 = [ξ]1 is an invariant distribu- ˜ define Θ tion of the associated transition kernel. Let η = ϕ ⊗ Θ ˜ and let π ˜ be ˜ Θ). invariant for (Λ, ˜ Then ξ 1  π ˜. ¯ ∈ P{1, . . . , N } such that kφ 2. Given δ > 0 there exists φ ¯ − ϕkv ≤ δ, and fur- thermore there exists (¯ξ, η¯) ∈ A(φ) ¯ such that the associated controlled Markov ¯ j is geometrically ergodic, and sequence X    ¯ ≤ R ¯ξ ¯ξ ⊗ Λ + R η¯ ¯ GN t, φ φ ⊗ Θ ≤ GN (t, ϕ) + δ. (5.7) 1 k 3. There exists a sequence of admissible controls (¯ξ , η¯) such that the associ- ated transition kernels are geometrically ergodic and the transportation cost ¯ k (·|x)kΛ(·|x)) is bounded as a function of x. Furthermore, R(Λ  ¯ k (· |x) kΛ (· |x)) → R Λ R(Λ ¯ (· |x) kΛ (· |x) (5.8) k k k and k¯ξ 1 − ¯ξ 1 kv → 0, where ¯ξ = ¯ξ 1 ⊗ Λ ¯ k , ¯ξ = ¯ξ 1 ⊗ Λ ¯ We place the proof for Part 3 of the Lemma in the appendix at the end of this chapter, since it requires a proposition that will be stated later in the next section. ¯ by the decompositions ξ = ξ 1 ⊗ Λ Proof. Part 1. Define Λ ¯ and Θ ˜ defined as in the ¯ Θ) statement. Let p¯(x, ·) and p˜(x, ·) be the transition kernels associated with (Λ, ˜ 72 ˜ Θ) and (Λ, ˜ according to the analogues of (5.3). Then applying Part 4 of Lemma 4.5 and Lemma 0.7(g) to get the first inequality Z p (x, ·) k˜ R (¯ p (x, ·)) ξ 1 (dx) T1 Z Z = ˜ δ h(g(y),σ) (dx0 ) Θ(dσ ¯ |gnum (y))Λ(dy |x) T1 T1 ×Y×{T2 \{∆N }} R # d δ h(g(y),σ) (·) ˜ Θ(dσ |gnum (y)) ¯ Λ(dy |x) Y×{T2 \{∆N }} × log R (x0 ) ξ 1 (dx) ˜ ˜ d Y×{T2 \{∆N }} δ h(g(y),σ) (·) Θ(dσ |gnum (y))Λ(dy |x) Z Z ¯ |x)  ¯ dΛ(· ≤ Λ(dy |x) log (y) ξ 1 (dx) T1 T2 \{∆N } ˜ |x) dΛ(· Z Z ¯ |x)  ¯ d Λ(· = Λ(dy |x) log (y) ξ 1 (dx) T1 T2 \{∆N } dΛ(· |x) Z Z  ¯ dΛ(· |x) + Λ(dy |x) log (y) ξ 1 (dx) T1 T2 \{∆N } ˜ |x) dΛ(· Z Z  = ¯ R Λ(· |x) kΛ(· |x) ξ 1 (dx) + log c (x) ξ 1 (dx) T1 T1 < ∞. Hence the set C = {x ∈ T1 : p¯ (x, ·)  p˜ (x, ·)} satisfies ξ 1 (C) = 1. If we redefine p¯ (x, ·) to equal p˜ (x, ·) for x ∈ Cc , then ξ 1 is also an invariant measure of the new transition kernel. Let p¯ also denote this redefined kernel. One can show by induction that p¯(l0 ) (x, ·)  p˜(l0 ) (x, ·) for all x ∈ C, where l0 is the number appearing in Lemma 5.5. Therefore for any Borel set A with ξ 1 (A) > 0, iterating ξ 1 p¯ = ξ 1 yields Z p¯(l0 ) (x, A) ξ 1 (dx) = ξ 1 (A) > 0. Hence there exists a Borel set B such that ξ 1 (B) > 0 and p¯(l0 ) (x, A) > 0 for all 73 x ∈ B. It follows that p˜(l0 ) (x0 , A) > 0 for some x0 ∈ B ∩ C. By part 5 of the ˜ (A) > 0, and therefore ξ 1  π previous lemma π ˜. ˜ by (5.6). For a ∈ (0, 1), the probability measure ˜ define φ Part 2. For the given Θ ˜ satisfies kϕa − ϕk = akϕ − φk ϕa = (1 − a)ϕ + aφ ˜ v ≤ 2a. Hence to satisfy the first v ¯ = ϕa for any a ∈ (0, δ/2]. Next define probability measures for claim we can take φ a ∈ (0, δ/2] by ¯ξ = (1 − a) ξ + a˜ π ⊗ Λ, ˜ ⊗ Θ. ˜ and η¯ = (1 − a) η + aφ ˜ (5.9) Then ¯ξ 1 = (1 − a) ξ 1 + a˜ ¯ ⊗ Θ. π and η¯ = φ ˜ We first claim that (¯ξ, η¯) ∈ A(φ), ¯ i.e., (¯ξ, η¯) is feasible. Indeed, Z δ h(g(y),σ) (A) Θ ˜ (dσ |gnum (y)) ¯ξ 2 (dy) Z n h io = ˜ (dσ |gnum (y)) (1 − a) ξ 2 + a π δ h(g(y),σ) (A) Θ ˜ ˜⊗Λ 2 = [(1 − a) ξ 1 + a˜ π ] (A) = ¯ξ 1 (A) and Z Z n h io 1{gnum (y)=n} ¯ξ 2 (dy) = 1{gnum (y)=n} (1 − a) ξ 2 + a π ˜ ˜⊗Λ 2 h i = (1 − a) ϕ + aφ ˜ ({n}) ¯ ({n}) . = φ Let q and p¯ be the transition kernels defined as in (5.3) associated with the transportation and the resampling controls induced by (ξ, η) and (¯ξ, η¯) (note that p¯ is based on ¯ξ defined in (5.9) and is not the same as in Part 1). Then ξ 1 and ¯ξ 1 are 74 invariant measures of q and p¯, respectively. Furthermore, ¯ξ ⊗ p¯ = (1 − a) ξ ⊗ q + a˜ π ⊗ p˜. (5.10) 1 1 We next verify (5.7). By the convexity of relative entropy    ¯ ≤ R ¯ξ ¯ξ ⊗ Λ + R η¯ ¯ GN t, φ φ ⊗ Θ 1   = R (1 − a) ξ + a˜ π⊗Λ ˜ k(1 − a) ξ 1 ⊗ Λ + a˜ π⊗Λ   ˜⊗Θ + R (1 − a) η + aφ ˜ (1 − a)ϕ ⊗ Θ + aφ ˜⊗Θ ≤ (1 − a) [R (ξ kξ 1 ⊗ Λ) + R (η kϕ ⊗ Θ)] h    i +a R π ˜⊗Λ ˜ k˜π⊗Λ +R φ ˜⊗Θ˜ ˜ φ ⊗ Θ . ˜  Θ, and finally the Using the facts that Θ is a measure on a finite space, that Θ ˜ and Λ, we have that explicit relation between Λ     ˜ ˜ ˜ ˜ ˜ ⊗ Λ k˜ R π π ⊗ Λ + R φ ⊗ Θ φ ⊗ Θ < ∞. Therefore for sufficiently small a > 0,  ¯ ≤ GN t, φ inf [R (ξ kξ 1 ⊗ Λ) + R (η kϕ ⊗ Θ)] + δ = GN (t, ϕ) + δ. (ξ,η)∈A(ϕ) We are now left to show the geometric ergodicity of the associated controlled ¯ j . In fact, if we can prove that p¯ and p˜ are mutually absolute continuous sequence X π ¯ j . Since ˜ -a.s., then repeating the proof as in Lemma 5.5 yields the ergodicity of X ¯ξ 1  π ˜ and ¯ξ 1 (A) ≥ a˜ π (A) for all Borel sets A of T1 , the Radon-Nikodym derivative f (x) = (d¯ξ 1 /d˜ π )(x) satisfies f (x) ∈ [a, ∞) π ˜ -a.s. for x ∈ T1 . For any Borel sets A and B, by (5.10) Z Z ˜ (dx) ≥ a˜ p¯ (x, B) f (x) π π ⊗ p˜ (A × B) = a p˜ (x, B) π ˜ (dx) . A A 75 ˜ -a.s. for all x ∈ T1 This implies that π a p¯ (x, B) ≥ p˜ (x, B) . (5.11) f (x) R On the other hand we know that R(¯ p(x, ·))¯ξ 1 (dx) < ∞, which yields p(x, ·)k˜ p¯ (x, ·)  p˜ (x, ·) , ¯ξ 1 -a.s. (5.12) Since ¯ξ 1 and π ˜ are mutually absolute continuous, there exists a Borel set C such that ¯ξ 1 (C) = π ˜ (C) = 0 and both (5.11) and (5.12) hold on the complement of C. Then without loss of generality, we can redefine p¯(x, ·) to equal p˜(x, ·) for x ∈ C. Therefore, p¯ and p˜ are mutually absolutely continuous π ˜ -a.s. Part 3. See Section 5.5. As noted previously, among the features that distinguish this problem from the standard theory for occupation measures is the fact that the underlying transition kernels depend on m. This is a substantial difference, since in the representation at the prelimit it is these kernels that appear in the relative entropy cost, and so a control that is designed on the limit transition kernel may not have even a finite cost for the prelimit kernel. In other words, one cannot in any straightforward way apply a control that is optimal or nearly optimal for the limit to the prelimit, and instead it must be carefully adapted so that the costs at the prelimit approximate those at the limit. Such an adaptation is the concern of the next lemma. In this lemma, we will relate perturbations on the transition kernel of a single particle to perturbations of the increments in the underlying random walk, and vice versa. In this context, a controlled increment will mean a distribution on increments that can depend in a Borel measurable way on the entire past of the random walk. Lemma 5.7. Let αm j (·|x) be the rescaled single particle transition kernel defined in 76 (3.6), and let x ∈ [0, 1). 1. Given a sequence of controlled increments {ν m j,i , i = 0, 1, . . .}, define the process m m Z¯j,i m by Z¯j,0 m = j + x and Z¯j,i+1 m = Z¯j,i m + ¯θj,i , where ¯θj,i has the conditional distribution ν m ¯m j,i . Let σ ¯ m / (−1, j + 1)} and τ m j = min{i ≥ 0 : Zj,i ∈ j be the distribution of Z¯j,¯ m σmj . Then   ¯m σ j −1 X  ¯  m m E R(ν m j,i kµ) ≥ R τ j αj . i=0 2. Conversely, given a controlled transition kernel τ (·|x) that satisfies R(τ kαm j ) < ∞ and dτ (·|x)/dαm j (·|x) ≥ c for some c > 0, one can construct controlled increments {ν m ¯m ¯ mσm has distribution τ , and j,i } and a process {Zj,i }, such that Zj,¯ j  m  ¯ j −1 σ m X R τ αj = E ¯ R(ν m  j,i kµ ) . i=0 Remark 5.8. The conclusion of the previous lemma is also valid for the limit transi- tion kernel γ (·|x, t). For instance, for each t ∈ [0, 1] and given a controlled transition kernel τ , there exists a controlled process Z¯i+1 (t) = Z¯i (t) + ¯θi (t) where the ¯θi (t) are ¯ (t) = min{i ≥ 0 : Z¯i (t) ≥ 1}, iid with distribution ν i (·|t), and a stopping time σ such that Z¯σ¯ (t) (t) has distribution τ and   σ ¯ (t)−1 X ¯ R (τ kγ (·|x, t)) = E R(ν i (· |t) kµ (· |t )) . i=0 The next result will also be used in the construction of a nearly optimal control. Proposition 5.9. The local rate function GN (t, φ) defined in (4.22), i.e. Z Z  GN (t, φ) = inf R (ξ(·|x) kΛ(· |x,t)) ξ 1 (dx) + R (η(·|n) kΘ(· |n)) η 1 (dn) , (ξ,η)∈A(φ) 77 is convex in the second variable. 5.4 Proof of the Upper Bound (2.3) We now proceed to the proof of the large deviation lower bound (2.3):  ˆm ≤ lim sup W inf IN (`) + F N , ` N m→∞ `∈P({1,...,N }) Z 1  N = inf R inf GN (t, φ (t)) dt + F , ` . `∈P({1,...,N }) φ:`= [0,1] φ(t)dt 0 Proof of (2.3). The infimum in (2.3) is over all measurable φ. Since such φ are not regular enough for the construction of a nearly optimal control, we first show how one can, without loss, assume greater regularity. To simplify notation, we omit the N dependence from IN and GN . Step 1: Existence of a near infimizer, `∗ and φ∗ (t), to the right hand side of (2.3). Given ε > 0, choose `∗ ∈ P ({1, . . . , N }) such that  I (`∗ ) + F N , `∗ ≤ inf I (`) + F N , ` + ε < ∞. `∈P({1,...,N }) R1 For such `∗ and ε there is φ∗ (t), t ∈ [0, 1] with `∗ = 0 φ∗ (s) ds satisfying Z 1 G (t, φ∗ (t)) dt ≤ I (`∗ ) + ε. 0 Step 2: Discretization of φ∗ (t) in time, φκ (t), without increasing the cumulative rate. Rt Now let L (t) = 0 φ∗ (s) ds, so that L (1) = `∗ . For κ ∈ N and t ∈ [0, 1] define a piecewise linear approximation to L (t) via Z (i+1)/κ φκ (t) = κ φ∗ (s) ds if t ∈ [i/κ, (i + 1) /κ) , i ∈ {0, 1, . . . , κ − 1} i/κ 78 and Z t Lκ (t) = φκ (s) ds. 0 R1 R1 Note that Lκ (1) = 0 φκ (s) ds = 0 φ∗ (s) ds = `∗ . The discrete form of φκ (t) will be used, and we denote φκ (i/κ) = φκi . The convexity of G in the second variable proved in Proposition 5.9 yields Z Z (i+1)/κ ! 1   1 X κ−1 G t, L˙ κ (t) dt = G t, κ φ∗ (s) ds dt 0 κ i=0 i/κ X κ−1 Z (i+1)/κ ≤ G (t, φ∗ (t)) dt i=0 i/κ Z 1 = G (t, φ∗ (t)) dt. 0 For the given ε > 0 and κ ∈ N, it follows that Z   1  G t, L˙ κ (t) dt + F N , Lκ (1) ≤ inf I (`) + F N , ` + 2ε. (5.13) 0 `∈P({1,...,N }) R1 Step 3: Discrete sum approximation to the cumulative rate 0 G(t, L˙ κ (t))dt. Con- κ struction of a sequence of near infimum ergodic admissible controls (¯ξ i , η¯κi+1 ). For convenience in the presentation, we assume that the infimum in the definition of G is achieved. If not true, then we could alternatively work with a quantity that is within ε of the infimum as in step 1. Denote the infimizer in (4.22) at (i/κ, φκi ) by (ξ κi , η κi+1 ). Under Condition 1.4 the measures Λ(· |x, t) for different values of t are mutually absolutely continuous with uniformly bounded and continuous Radon- Nikodym derivatives. Thus for all t ∈ [i/κ, (i + 1)/κ) ˙ κ G(t, Lκ (t)) − G (i/κ, φi ) Z ≤ [R (ξ i (· |x) kΛ(· |x, t)) − R (ξ i (· |x) kΛ(· |x, i/κ))] ξ i,1 (dx) κ κ κ < εκ , 79 R1 where εκ → 0 as κ → ∞, and so we can approximate 0 G(t, L˙ κ (t))dt by the sum 1 Pκ−1 κ κ i=0 G (i/κ, φi ). Choose κ large enough that εκ < ε. Furthermore, given any δ > 0, Lemma 5.6 implies that for each (ξ κi , η κi+1 ) ∈ A (φκi ) κ   κ there exist ¯ξ , η¯κ ¯ κ with ¯ ∈A φ φ − φκ ≤ δ, and such that the associated i i+1 i i i v Markov process is geometrically ergodic with bounded local transportation cost and Z Z κ  κ  κ R ¯ξ i (· |x) kΛ (· |x, i/κ) ¯ξ i,1 (dx) + ¯ (dn) R η¯κi+1 (· |n ) kΘ (· |n) φi ≤ G (i/κ, φκi ) + δ. . Pκ−1 ¯ κ ∗ If `¯ = 1 φ , then ` − ¯ ≤ δ. Note that by construction, `(0) ` ¯ = 0. Thus, κ i=0 i v we can choose δ ∈ (0, ε) small enough and use the boundedness of F N to obtain N N ∗ F , `¯ ≤ F , ` + ε. Combining the above inequalities with (5.13), we have 1X κ−1   ¯ κ + F N , `¯ ≤ G i/κ, φ inf I (`) + F N , ` + 5ε. (5.14) i κ i=0 `∈P({1,...,N }) κ Step 4: Reconstruction of controlled random walks from (¯ξ i , η¯κi+1 ). κ For each i ∈ {0, 1, . . . , κ − 1}, ¯ξ i (· |x) is a probability measure on the location of N particles after transportation. Writing this as a product of conditional distri- butions, we have the decomposition Y N ¯ξ κ (dy1 , . . . , dyN |x) = τ κi,n (dyn |x, y1 , . . . , yn−1 ) . i n=1 For notational convenience we write τ κi,n (dyn |x, y1 , . . . , yn−1 ) = τ κi,n (dyn |x), since the dependence on the previously generated particles is not particularly important. 80 By construction of the ergodic control in (5.9), ¯ξ κ = (1 − a) ξ κ + a˜ ˜ π ⊗ Λ, i i where a is chosen according to the parameter δ described in Step 3. Following a similar discussion to that in Part 2 of Lemma 5.6, we obtain the inequality ¯ξ κ (B |x) ≥ a ˜ i Λ (B |x, i/κ) , fiκ (x) κ where fiκ (x) = (d¯ξ i,1 /d˜ π )(x), B is any Borel subset of T2 , and π ˜ is the invariant ˜ (· |·, i/κ) and η κi+1 . Let γ κi (·|·) = γ(·|·, i/κ). By the distribution associated with Λ ˜ we then have dτ κi,n /dγ κi > 0 for all i, n. Hence, Remark 5.8 applies. definition of Λ, That is, for each single particle transition kernel τ κi,n , one can construct a sequence of conditional distributions {ν κi,n,k } ⊂ P(R) associated with random walks Z¯i,n,k+1 κ = κ Z¯i,n,k κ + ¯θi,n,k , Z¯i,n,0 κ = Z¯i−1,n,¯ κ σκi−1,n ¯ κi,n = min{k ≥ 0 : Z¯i,n,k and stopping times σ κ ≥ 1} satisfying Z¯i,n,¯ κ σκi,n ∼ τ κi,n (·), such that the relative entropy cost is preserved, i.e.,   ¯κ σ i,n −1  X  ¯ R τ κi,n (· |x) kγ (· |xn , i/κ) = E R ν κi,n,k (·) kµ (· |i/κ )  . (5.15) k=0 The random walks are constructed sequentially with the increment distributions   ¯ ¯θκ ∈ · x, Z¯ κ , . . . , Z¯ κ P = ν κi,n,k · x, Z¯i,n,0 κ , . . . , Z¯i,n,k κ . i,n,k i,n,0 i,n,k Again for simplicity we write ν κi,n,k (·|x, Z¯i,n,0 κ , . . . , Z¯i,n,k κ ) = ν κi,n,k (·|x). Step 5: Introduction of large deviation parameter and construction of the prelimit m controls (ˆξ j , ηˆm j ) for the final proof of (2.3). Now we introduce a large deviation parameter m ∈ N. We first elaborate on m the construction of transportation controls which are later denoted as ˆξ j . For n ∈ {1, . . . , N }, i ∈ {0, 1, . . . , κ − 1} and j/m ∈ [i/κ, (i + 1) /κ) define the controlled 81 random walk m Zˆj,n,k+1 m = Zˆj,n,k m + ˆθj,n,k , Zˆj,n,0 m = Zˆj−1,n,ˆ m σmj−1,n with increment distributions  n m o  ν κ (· |x) k≤σ ˆm ˆ ˆθ ˆm ˆ m i,n,k j,n (b) P j,n,k ∈ · x, Zj,n,0 , . . . , Zj,n,k = ,  µ(·|Zˆ m /m + j/m) otherwise j,n,k ˆm ˆm ˆm where σ j,n (b) = min{k ≥ 0 : Zj,n,k ≤ j(b − 1) or Zj,n,k ≥ 1} with b ∈ (0, 1). This means that we give up trying to control the process if it is sufficiently negative, κ and so the control used differs slightly from the control based completely on ¯ξ i , but m as we will see momentarily this is needed to control the cost. We let ˆξ j denote the corresponding transportation kernel over N particles. Also, the corresponding resampling controls are defined as ηˆm κ j (·|n) = η i+1 (·|n) for j/m ∈ [i/κ, (i + 1) /κ). As noted in (3.12), given b < 1 there exists a cb < 1 such that for b2 t ≤ s ≤ t/b2 1 cb µ (dy |t ) ≤ µ (dy |s) ≤ µ (dy |t ) . cb Thus, for m big enough such that (jb/m, (j + 1) /m) ⊂ (ib2 /κ, i/(b2 κ)) and k ≤ ˆm σ j,n (b), we have the bound     m R ν κi,n,k (· |x) µ · Zˆj,n,k /m + j/m ≤ R ν κi,n,k (· |x) kµ (· |i/κ) − log cb . (5.16) 82 Hence one can bound the prelimit transportation cost according to m  m m  N σ X ˆ j,n (b)−1 X    ¯ ˆ ¯ κ ˆm ER ξ j (· |x) Λj (· |x) ≤ E R ν i,n,k (· |x) µ · Zj,n,k /m + j/m n=1 k=0 ˆm N σ j,n (b)−1 X X  ¯ ≤ E R ν κi,n,k (· |x) kµ (· |i/κ ) n=1 k=0 ¯ σ m (b) log cb −N Eˆ j,n ˆ m (0)−1 σ X N X j,n  ¯ ≤ E R ν κi,n,k (· |x) kµ (· |i/κ) + εm b n=1 k=0  ¯ ¯ξ κ (· |x) kΛ (· |x, i/κ) + εm , = ER (5.17) i b where εm b → 0 as m → ∞. The first inequality is an application of Chain Rule (Theorem 0.8) and Lemma 5.7(a). Note that the controlled increments after time ˆm σ j,n (b) will have no contribute to the relative entropy cost, since, by construction, the control follows the law of underlying random walk. The second inequality is obtained ¯ σ m (b) < ∞ is guaranteed by from (5.16). The third inequality is true because sup Eˆ j,n the finiteness of relative entropy. The last equality is due to (5.15) and the fact that ˆm σ ¯ κi,n . j,n (0) corresponds to σ It is useful to review where we are in the construction at this stage. Starting with a control that is nearly optimal for the limit, we have constructed a control that can be applied to the increments of individual particles at the prelimit level, and which leads to nearly the same cost. However, to do so we have had to make a small change in the distribution of the processes at the threshold crossings. Specifically, in order to control the costs it was necessary to suspend active control of particles once that had moved far enough to the left, switch back to the underlying distribution, and rely on the law of large numbers for the underlying distribution to finish the construction. This lead to a small change in the distribution of the transition kernel whose effect will vanish as m → ∞. However, it means that we cannot directly apply the ergodic theorem as would be traditionally done in the large deviations analysis 83 at this point in the proof, since we no longer have a stationary ergodic process, but only one whose nonstationary transition kernels are in some sense uniformly close to those of a geometrically ergodic Markov process. To prove the convergence of the controlled distributions on the number of success particles, we will need the following result, whose proof is presented whose proof is presented at the end of this chapter. Proposition 5.10. Consider j ∈ [i/κ, (i + 1)/κ). Let pˆm ¯κi (x, dy) be j (x, dy) and p m κ the transition kernels associated with (ˆξ j , η¯κi+1 ) and (¯ξ i , η¯κi+1 ) respectively. Denote ˆm the empirical measure of {X ˆm j , i/κ ≤ j/m ≤ (i + 1)/κ} over this interval by Ji and ¯ κ . Then the invariant distribution of p¯κi by λ i ˆm ¯ κ Ji − λi → 0 v in probability as m → ∞. Step 6: Completion of the proof of (2.3). With Proposition 5.10 in hand, we are now ready to complete the proof of (2.3). The starting point is the representation (3.8). It is easy to check that Steps 1- ˆ m. 5 provide an admissible control sequence, and hence gives an upper bound on W N ˆm Using such controls we define the controlled sequences (X ˆm j , Yj ). For i ∈ {0, . . . , κ− 1}, let X m−1 j mi = 1[i/κ,(i+1)/κ) j=0 m be the number of stages within the continuous time interval [i/κ, (i + 1)/κ), and X i−1 ij = mk + j k=0 be the index of the jth element at within interval [i/κ, (i + 1)/κ). With these 84 definitions, we can write ( ) 1 X h ˆm  ˆ m        i m−1 ˆ m ˆm m ˆm ˆm E R ξ j · Xj Λj · Xj + R ηˆj+1 · Nj+1 Θ · Nj+1 m j=0 ( X κ−1 mi ˆ 1 X mi −1 h      m ˆm m ˆm = E R ˆξ ij · X ij Λ ij · X ij i=0 m mi j=0 )     i κ ˆm ˆm + R η¯ · N i+1 Θ · N ij+1 ij+1 ( X κ−1 mi 1 X mi −1 h      ˆ ¯κ ˆm ˆm ≤ E R ξ i · Xij Λ · X ij , i/κ i=0 m mi j=0 )     i ˆm ˆm + R η¯κi+1 · Nij+1 Θ · N ij+1 + εm . b The equality is just due to definitions, while the inequality follows from (5.17). Since κ is fixed, mi /m → 1/κ as m → ∞, and it will suffice to obtain upper bounds on each term of the form ( 1 X mi −1 h      ˆ ¯ κ ˆm ˆm lim sup E R ξ i · Xij Λ · Xij , i/κ mi →∞ mi j=0 )     i ˆm ˆm + R η¯κi+1 · N ij+1 Θ · N ij+1 . (5.18) ˆm For each i, the chain X ˆm ij is obtained from the transition kernel ξ ij , which is uniformly κ close to ¯ξ i . It follows from part 3 of Lemma 5.6 and Proposition 5.10 that the limit R κ κ superior of the first term is bounded above by R(¯ξ i (·|x)kΛ(·|x, i/κ))¯ξ i,1 (dx) + ε. A similar but simpler argument can be applied to the second term to obtain the combined upper bound Z Z κ  κ  κ R ¯ξ i (· |x) kΛ (· |x, i/κ) ¯ξ i,1 (dx) + ¯ (dn) + 2ε R η¯κi+1 (· |n) kΘ (· |n) φ i κ ¯κ. for (5.18), where ¯ξ i,1 = λi 85 To complete the argument, we will have to compute an upper bound for EhFˆ N, L ˆ m i, ˆ m = 1/m Pm δ ˆ m . Since the transportation measures ˆξ m and ¯ξ κ are close where L j=1 N j j i in total variation norm for j ∈ [i/κ, (i + 1)/κ) by construction, together with Propo- ˆm sition 5.10 and the definitions of L ¯ j and ` for sufficiently large m D E ˆ m ≤ F N , `¯ + ε. ˆ FN, L E We now combine the last four displays and (5.14) to obtain ˆm lim sup W N m→∞ Z Z 1 κ  κ  κ ≤ ¯ ¯ ¯ (dn) R ξ i (· |x) kΛ (· |x, i/κ) ξ i,1 (dx) + R η¯κi+1 (· |n ) kΘ (· |n) φi κ N + F , `¯ + 3ε + lim sup lim sup εm b b→1 m→∞ 1 X κ−1  ≤ ¯ κ + F N , `¯ + 4ε G i/κ, φ i κ i=0  ≤ inf I (`) + F N , ` + 9ε. `∈P({1,...,N }) Since ε is arbitrary, we have proved the upper bound (2.3). 5.5 Appendix We finish this chapter by giving the proofs of Lemma 5.7, Proposition 5.9, Proposition 5.10 and Lemma 5.6(3). Lemma 5.7. Let αm j (·|x) be the rescaled single particle transition kernel defined in (3.6), and let x ∈ [0, 1). 1. Given a sequence of controlled increments {ν m j,i , i = 0, 1, . . .}, define the process m m Z¯j,i m by Z¯j,0 m = j + x and Z¯j,i+1 m = Z¯j,i m + ¯θj,i , where ¯θj,i has the conditional distribution ν m ¯m ¯ m / (−1, j + 1)} and τ m be the j,i . Let σ j = min{i ≥ 0 : Zj,i ∈ j 86 distribution of Z¯j,¯ m σmj . Then   ¯m σ j −1 X  ¯  m m E R(ν m j,i kµ) ≥ R τ j αj . i=0 2. Conversely, given a controlled transition kernel τ (·|x) that satisfies R(τ kαm j ) < ∞ and dτ (·|x)/dαm j (·|x) ≥ c for some c > 0, one can construct controlled increments {ν m ¯m ¯ m m has distribution τ , and j,i } and a process {Zj,i }, such that Zj,¯ σj  m  ¯ j −1 σ m X R τ αj = E ¯ R(ν m  j,i kµ ) . i=0 Proof. Part 1. For fixed T < ∞, consider a modification of ν m ¯m j,i , such that if σ j < T, then we redefine ν m ¯m j,i = µ for σ j < i ≤ T − 1. With this modified definition of the control, which does not change the distribution of the hitting time on the set σm {¯ j < T }, we have     ¯m σ j −1 σm (¯ j ∧T )−1 X X ¯ R(ν m  ≥ E ¯ R(ν m  E j,i kµ) j,i kµ) (5.19) i=0 i=0   = R QT {0,...,T −1} P|{0,...,T −1} , where QT denotes the modified joint measure on the space of increments, QT |{0,...,T −1} denotes the restriction to the first T coordinates in the underlying product space, and P|{0,...,T −1} denotes product measure with marginal µ. We now consider the disjoint, finite partition of the space RT = ATj ∪ BjT ∪ DjT , 87 where n o ATj = ω : Z¯j,σ m m ≥ j + 1, σ j ¯ m j < T , n o BjT = ω : Z¯j,σm j ¯m m ≤ −1, σ j < T ,  DjT = ω:σ ¯mj ≥ T . Using the approximation property of relative entropy via sums over finite measurable partitions Lemma 0.7(g), we obtain     T m,T m,T R Q {0,...,T −1} P|{0,...,T −1} ≥ R τ j αj , (5.20) where τ m,T j , αm,T j ∈ P ({(−∞, −1] ∪ [j + 1, ∞) , Σ}) are the measures induced by QT |{0,...,T −1} and P|{0,...,T −1} , and Σ corresponds to the event σ m j ≥ T. We know that αm,T j ¯m (Σ) → 0 as T → ∞. Let α m j denote the extension of αj to m,T ¯m {(−∞, −1] ∪ [j + 1, ∞) , Σ} with α ¯m j (Σ) = 0. Let τ j denote the limit of τ j , which must exist by monotonicity. By the lower semi-continuity of relative entropy,    m,T m m lim inf R τ m,T j α j ≥ R τ ¯ j α ¯j . (5.21) T →∞ If τ¯m τm j (Σ) > 0 then R(¯ αm j k¯ ¯m j ) = ∞, and if τ τm j (Σ) = 0 then R(¯ αm j k¯ m m j ) = R(τ j kαj ). The last sentence and (5.19), (5.20) and (5.20) imply   ¯m σ j −1 X  ¯  m m E R(ν m j,i kµ ) ≥ R τ j αj . i=0 Part 2. Define   dτ (· |x ) U (x; y) = − log (y) . dαmj (· |x) Since by assumption U (x; y) is bounded from above, according to [9, Proposition 88 4.5.1] Z  Z  m − log e−U (x;y) αm j (dy|x) = inf R β αj + U (x; y) β (dy) . β:R(βkαm j )<∞ Since R(τ kαm j ) < ∞, by the definition of U τ (dy|x) = e−U (x;y) αm j (dy|x) (5.22) achieves the infimum in the variational formula, i.e., Z Z  − log e −U (x;y) αm j (dy|x) = R τ αm j + U (x; y) τ (dy|x) . We next consider the analogous relations on the space of increments of the pro- cess. We have Z    Z    ¯ m m (ω) −U x;Z − log e j,¯ σ j ¯ m P (dω) = inf R (Q kP) + U x; Zj,¯σm (ω) Q (dω) , Q j ¯m where σ j denotes the first hitting time of the process and P is the underlying product probability with marginal µ. The definition of αm j implies that for any integrable function K Z    ¯ m m (ω)  Z −U x;Z K Z¯j,¯ m σmj (ω) e j,¯ σ j P (dω) = K(y)e−U (x;y) αm j (dy|x) . Also, the minimum is achieved at   ¯ m m (ω) −U x;Zj,¯ σ Q (dω) = e j P (dω) . (5.23) 89 Since the first hitting probability αm j is induced by P, (5.22) and (5.23) imply Z   Z   −U  ¯ m m (ω)  x;Z U x; Z¯j,¯ m σmj (ω) Q (dω) = U x; Z¯j,¯ m σjm (ω) e j,¯ σ j P (dω) Z = U (x; y) e−U (x;y) αm j (dy|x) Z = U (x; y) τ (dy|x) . Hence, by the chain rule "∞ #  m  ¯ j −1 σ m X X R τ αj = R (Q kP) = EQ R(ν m j,i kµ) ≥ E Q R(ν m  j,i kµ ) , i=0 i=0 where {ν m j,i } are the controls defined by factoring Q. Combining this with the bound proved in part 1, we have constructed controls and a controlled process {Z¯j,i m } satisfying   ¯m σ j −1 X m EQ  R(ν m  j,i kµ) = R τ αj , i=0 where τ is the distribution induced by the stopped, controlled process. Although here we have constructed the control on the canonical space R∞ , it can be applied in the general setting by making the obvious identifications. Proposition 5.9. The local rate function GN (t, φ) defined in (4.22) is convex in the second variable. Proof. By (4.21), Z dξ (· |x) R (ξ kξ 1 ⊗ Λ) = ξ (dx, dy) η (dσ |gnum (y)) log (y), dΛ (· |x, t) Z dη (· |gnum (y)) R (η kη 1 ⊗ Θ) = ξ (dx, dy) η (dσ |gnum (y)) log (σ). dΘ (· |gnum (y)) 90 Summing these two equalities, we have GN (t, φ) Z dξ (· |x) dη (· |gnum (y)) = inf ξ (dx, dy) η (dσ |gnum (y)) log (y) (σ). (ξ,η)∈A(φ) dΛ (· |x, t) dΘ (· |gnum (y)) Let Ψ (dy, dσ |x, t) = Θ (dσ |gnum (y)) Λ (dy |x, t) , and let p be the transition kernel associated with (Λ, Θ). Denote the corresponding ¯ p¯. Then the local rate function G can be written as controlled versions by Ψ, GN (t, φ)  Z Z     = inf inf ¯ (· |x) kΨ (· |x, t) : φ (·) = ξ 1 (dx) R Ψ ¯ ξ 1 ⊗ Ψ 2 (dy) . ξ 1 =ξ 1 p¯ {gnum (y)=·}  ¯ i , i = 1, 2 be within ε > 0 of the infimum. For φ1 , φ2 ∈ P({1, . . . , N }), let ξ i1 , Ψ ¯ 1 +(1 − β) ξ 2 ⊗ Ψ For β ∈ (0, 1), the convex combination βξ 11 ⊗ Ψ ¯ 1 can be decomposed 1 ¯ where (ξ 1 , Ψ) into ξ 1 ⊗ Ψ, ¯ are defined on the same space as (ξ i , Ψ ¯ i ). Then ξ 1 = 1 βξ 11 + (1 − β) ξ 21 and Z   ¯ (dy) = βφ1 ({n}) + (1 − β) φ2 ({n}) , 1{gnum (y)=n} ξ 1 ⊗ Ψ 2 Furthermore, Z ¯ (dy, dσ |x) ξ 1 (dx) δ {h(g(y),σ)} (·) Ψ Z = β δ {h(g(y),σ)} (·) Ψ ¯ 1 (dy, dσ |x) ξ 11 (dx) Z + (1 − β) δ {h(g(y),σ)} (·) Ψ ¯ 2 (dy, dσ |x) ξ 2 (dx) 1 = βξ 11 (dx) + (1 − β) ξ 21 (dx) = ξ 1 (dx) . 91 The convexity of relative entropy then implies βGN (t, φ1 ) + (1 − β) GN (t, φ2 ) + 2ε   ≥ βR ξ 11 ⊗ Ψ ¯ 1 ξ 1 ⊗ Ψ + (1 − β) R ξ 2 ⊗ Ψ ¯ 1 ξ 2 ⊗ Ψ 1 1 1   ≥ R βξ 11 ⊗ Ψ ¯ 1 βξ 11 + (1 − β) ξ 21 ⊗ Ψ ¯ 1 + (1 − β) ξ 21 ⊗ Ψ  = R ξ1 ⊗ Ψ ¯ kξ 1 ⊗ Ψ ≥ GN (t, βφ1 + (1 − β) φ2 ) . Since ε > 0 is arbitrary, convexity follows. Proposition 5.10. Consider j ∈ [i/κ, (i + 1)/κ). Let pˆm ¯κi (x, dy) be j (x, dy) and p m κ the transition kernels associated with (ˆξ j , η¯κi+1 ) and (¯ξ i , η¯κi+1 ) respectively. Denote ˆm the empirical measure of {X ˆm j , i/κ ≤ j/m ≤ (i + 1)/κ} over this interval by Ji and ¯ κ . Then the invariant distribution of p¯κi by λ i ˆm ¯ κ Ji − λi → 0 v in probability as m → ∞. ˆm Proof. Since {X ˆm / j } is tight, given η > 0, there is compact S1 ⊂ T1 such that P{Xj ∈ S1 } ≤ η for all j ≤ m, including the index that first exceeds i/κ. Using standard results from stochastic stability, there is a compact set S2 ⊂ T1 such that, after starting in S1 , the probability of escaping from S2 is also less than η. To simplify the presentation, we denote the initial value of j ∈ [i/κ, (i + 1)/κ) by zero and the final value by m, and write Jˆm for Jˆim . By the geometric ergodicity of p¯κi and the compactness of S2 κ,(k) ¯ κ p¯i (x, ·) − λi (·) → 0 v 92 κ,(k) uniformly in x ∈ S2 , where p¯i is the k-fold convolution of p¯κi . Let δ > 0, and assume that k is large enough that for such x κ,(k) ¯ κ i p¯ (x, ·) − λ i (·) < δ. (5.24) v Let pˆm ˆm j,` denote convolution of the transition kernels p ˆm jk+` , p ˆm jk+`+1 , . . . p (j+1)k+`−1 , ˆm ˆ m = x then the distribution of X ˆm so that if X kj+` k(j+1)+` is given by p j,` (x, ·). Let ζ > 0 be given. For m sufficiently large we know that m pˆj (x, ·) − p¯κi (x, ·) < ζ. (5.25) v Iterating the (5.25) gives m κ,(k) pˆj,` (x, ·) − p¯i (x, ·) ≤ kζ v for all ` and all x ∈ S2 , and thus m ¯ κ (·) < kζ + δ pˆj,` (x, ·) − λ (5.26) i v for all ` and all x ∈ S2 . Let k ∈ N be given. We will break the empirical measure Jˆm into k sums, and it follows from the fact that k will be fixed as m → ∞ that we can consider just those m of the form rk, r ∈ N. We then write   X Xk−1 [m/k]−1 X X [m/k]−1 X . 1 X ˆm m−1 k−1 k−1 1 1   1 k Jˆm = δX ˆm = δX ˆ m = δ ˆ m = J , m j=0 j m `=0 j=0 kj+` k `=0 m j=0 Xkj+` k `=0 ` where the last equality defines the Jˆ`m . Now fix ` ∈ {0, 1, . . . , k − 1} and consider any bound measurable function f : 93 T1 → R with kf k∞ ≤ 1. As usual, we have that   Z   ˆ f Xk(j+1)+` − f (y) pˆm m j,` ˆ X m jk+` , dy is a martingale difference sequence, and thus by (5.26)    Z  ˆ m ¯ κ m m f Xk(j+1)+` − f (y) λi (dy) 1{X ˆm } = ε1,j + ε2,j , jk+` ∈S2 where |εm m 1,j | ≤ kζ + δ and ε2,j has conditional mean zero and uniformly (in m and j) bounded second moment. It follows using a standard calculation that (k−1  Z Z ) [ κ k P f (x) Jˆ`m (dx) − f (y) λ ¯ (dy) ≥ 2 [kζ + δ] ≤k + 2η. i 4 [kζ + δ]2 m `=0 ¯κ Letting first m → ∞ shows that weak limits of the Jˆ`m are all within 2 [kζ + δ] of λi in total variation norm save on a set of probability no more than 2η. We now send ζ ↓ 0, then δ ↓ 0, and finally η ↓ 0 to complete the proof. We are now ready to proof Part 3 of Lemma 5.6. ¯ Proof. Let h(x, y) = (dΛ/dΛ)(x, y), where the Radon-Nikodym derivative is in y. For k ∈ N define Z k S (x) = 1 − hk (x, y) Λ (dy |x) , T2 where hk (x, y) = k ∧ h(x, y). Also, let Z ¯k Λ (A |x) = ˜ (A |x) hk (x, y) Λ (dy |x) + S k (x) Λ A k for any Borel set A of T2 . Obviously, (¯ξ , η¯) is also admissible and the associated transition kernel is equivalent to p¯(x, ·). Hence this transition kernel is also geomet- 94 rically ergodic. Furthermore, the relative entropy Z    ¯ k (· |x) Λ (· |x) = R Λ ¯ k (dy |x) + S k (x) R Λ log hk (x, y) Λ ˜ (· |x) Λ (· |x) T2 is uniformly bounded as a function of x. Since S k (x) → 0 as k → ∞, the Dominated Convergence Theorem implies (5.8). ¯ k (·|x) − Λ(·|x)k It is easy to check that the construction implies kΛ ¯ v → 0. It then k follows from Proposition 5.10 that the first marginals of ¯ξ converges to ¯ξ 1 in total k variation, i.e. k¯ξ 1 − ¯ξ 1 kv → 0. Chapter 6 Asymptotics of the Decay Rate as N →∞ In this chapter, we assume for simplicity that the increments {θi , i ∈ N} are iid with θi ∼ µ (dy) and evaluate the asymptotics of IN as N → ∞. The analogous result holds for the general model used previously, but with more involved constructions and notation. Under this assumption, it follows that the rate function of interest coincides with the local rate function. That is, for µ ∈ P ({1, . . . , N }), the rate function can be written as    IN (µ) = inf R ¯ξ ¯ξ 1 ⊗ Λ + R (¯ η k¯ η 1 ⊗ Θ) {¯ξ,¯η}∈A(µ)  where set A (µ) = {feasible ¯ξ, η¯ with η¯1 = µ}. As established in Theorem 2.2, the decay rate for the second moment has the expression  Z n   V N = inf R ¯ξ ¯ξ 1 ⊗ Λ + R (¯ η k¯ η 1 ⊗ Θ) − 2 log η¯1 (dn) . N Throughout this chapter we assume that all the conditions of Theorem 2.2 are in force, and adapt the notation to reflect that fact that the local rate function no longer 95 96 depends on t. Recalling that γ(dy|x) is the transition kernel for a single particle, it follows from Condition 1.4 and Condition 5.1 that there exists a unique pair of measures (π, κ) ∈ P([0, ∞)) × P({∆} ∪ [0, ∞)) satisfying κ (·) κ = [π ⊗ γ]2 , and π (·) = . (6.1) 1 − κ (∆) In other words, κ (∆) is the stationary mass absorbed by ∆ under γ, and if we renormalize the mass that remains then π is the invariant probability for the resulting transition kernel. It is easy to check that the large deviation rate defined in (1.5) is exactly γ = − log(1 − κ (∆)), and so the best possible rate of convergence for the Monte Carlo scheme is −2 log(1 − κ (∆)). By Jensen’s inequality V N ≤ −2 log(1 − κ (∆)), and it is not hard to show that the inequality is in general strict. The following result shows that this rate is achieved when the number of particles tends to infinity. Theorem 6.1. V N → −2 log(1 − κ (∆)) as N → ∞. We will study the asymptotics of V N using many of the same weak convergence arguments used previously. We first recall the following factorization of the under- lying transportation and resampling distributions: Λ (dy| x) = ⊗N i=1 γ (dyi |xi )   = ⊗N i=1 1[0,1] (xi )α (dyi |xi ) + 1[1,∞) (xi )δ xi (dyi ) Θ (dσ| n) = ⊗ni=1 δ i (dσ i ) ⊗N i=n+1 Un (dσ i ) . 97 As in Lemma 4.5 we can factor the controls according to ¯ξ (dx, dy) = ⊗N ¯ξ i (dxi , dyi |x1 , y1 , . . . , xi−1 , yi−1 ) = ⊗N ¯ξ i (dxi , dyi ) i=1 i=1 ¯ξ 1 (dx) = ⊗N ¯ξ i (dxi |x1 , . . . , xi−1 ) = ⊗N ¯ξ i (dxi ) i=1 1 i=1 1 η¯ (dσ |n) = ⊗N ¯i (dσ i |n, σ 1 . . . , σ i−1 ) ⊗ni=1 δ i (dσ i ) i=n+1 η = ⊗N ¯i (dσ i |n) ⊗ni=1 δ i (dσ i ) , i=n+1 η where the right hand side indicates the variables for which dependence is suppressed in the notation. With an abuse of notation we sometimes write η¯i (dσ i |n) for δ i (dσ i ) even when i ≤ n. Note that the controls all depend on N , but we also omit this de- pendence in the notation. The chain rule for relative entropy and Jensen’s inequality then yields " N #  X  i  i R ¯ξ ¯ξ 1 ⊗ Λ = E ¯ R ¯ξ (dx, dy) ¯ξ 1 (dx) ⊗ γ (dy |x ) i=1 " !# 1 X ¯i N 1 XN ¯ R i ¯ξ (dx) ⊗ γ (dy |x ) ≥ NE ξ (dx, dy) 1 N i=1 N i=1 ¯ indicates integration with respect to the suppressed variables. Since the where E total cost is bounded for all N , we conclude that ! 1 X ¯i N 1 XN i ¯ξ (dx) ⊗ γ (dy |x ) R ξ (dx, dy) 1 → 0 N i=1 N i=1 in probability. Tightness follows as in Chapter 4, and hence for any subsequence there exists a subsubsequence such that 1 X ¯i N ξ →π ¯ N i=1 1 98 in probability. Thus 1 X ¯i N ξ →π ¯ ⊗ γ. (6.2) N i=1 6.1 Proof of Theorem 6.1 For any feasible (¯ξ, η¯) controls, let {(X ¯ iN , Y¯iN ), i = 1, 2, . . . , N } be processes with ¯N, k = the given joint distributions. Define F¯iN to be the σ-field generated by {X k 1, 2, . . . , i − 1} and G¯iN generated by {(X ¯ N , Y¯ N ), k = 1, 2, . . . , i − 1}. Then for any k k bounded continuous function f  Z   i ¯ ¯ f Xi − f (x) ξ 1 (dx) , i = 1, 2, . . . , N  Z   i f Y¯i − f (y) ¯ξ (dx, dy) , i = 1, 2, . . . , N are martingale difference sequences, with respect to F¯iN and G¯iN , respectively. This implies ! 1 X 1 X ¯i N N 1¯ , ξ → (¯ π, π ¯) N i=1 Xi N i=1 1 " #! 1 X 1 X ¯i N N 1¯ , ξ → (¯ κ, κ ¯) . N i=1 Yi N i=1 2 in distribution, and by (6.2) κ ¯ satisfies κ π ⊗ γ]2 . ¯ = [¯ (6.3) Interpreting gloc (y) as vector valued with components gloc,i (y) that identify the locations from which the resampling controls can select, we have  Z     N ¯ f X E ¯i − f gloc,σ (Y ) η¯ dσ gnum Y ¯N i ¯ N Y ¯1, . . . , X ¯ ,X ¯ i−1 = 0. 99 This implies the bound  !2  X N XN Z 1  1   ¯ E f X ¯i − f gloc,σ (Y¯ N ) η¯i dσ gnum ¯ Y N  N i=1 N i=1 " N  Z 2 # 1 X    ¯ = E f ¯ X i − f gloc,σ (Y¯ N ) η¯i dσ gnum ¯ Y N N 2 i=1 ≤ 2 kf k2∞ /N → 0. (6.4) If we can show that N Z 1 X  ¯N → κ¯ (dx) δ gloc,σ (Y ¯i dσ gnum Y ¯ N ) (dx) η , (6.5) N i=1 1−κ ¯ (∆) in probability then (6.4) implies 1 X N κ ¯ (dx) 1X¯i → =π ¯. N i=1 1−κ ¯ (∆) By uniqueness of the invariant distribution under the renormalized transition kernel and (6.1), we conclude (¯ π, κ ¯ ) = (π, κ) . Then Theorem 6.1 will follow from the fact that by Fatou’s Lemma lim inf V N N →∞ " !# 1 X N   ≥ lim inf R ¯ξ ¯ξ 1 ⊗ Λ + R (¯ η k¯ ¯ log η 1 ⊗ Θ) − E2 1[0,∞) Y¯i N →∞ N i=1 ≥ −2 log (1 − κ (∆)) . Note that in fact we will have shown that along the minimizing sequence  R ¯ξ ¯ξ 1 ⊗ Λ + R (¯ η k¯ η 1 ⊗ Θ) → 0. 100 It remains to show that (6.5) is true. This is a consequence of the finiteness of the rate. In particular, η k¯ R (¯ η 1 ⊗ Θ) "Z " # " #! # 1 X N 1 X n ¯ ≥ NE R η¯i (dσ |n) (N − n)Un (dσ) + δ i (dσ) η¯1 (dn) N N i=1 i=1 for all N . Therefore N Z 1 X  δ gloc,σ (Y ¯i dσ gnum Y ¯ N ) (·) η ¯N N i=1 has the same limit in probability as   ¯N) Z gnum (Y Z 1  X    ¯N ] ¯ N ) (·) δ i (dσ) + [N − gnum Y δ gloc,σ (Y δ gloc,σ (Y ¯ N ) (·) Ug ¯ N ) (dσ) . num (Y N i=1 This completes the proof of Theorem 6.1. Part III Numerical Results 101 Chapter 7 Computational Results In this chapter, we present some computational results for interacting particle scheme under a simple one dimensional random walk setting. Problem Setup and Notations Let {Xj , j = 0, 1, . . .} be a sequence of i.i.d. random variables taking values in a discrete state space {−1, 1, 2} with probability {0.6, 0.3, 0.1} respectively. For each fixed n ∈ N, let Sn denote the sum X0 + X1 + · · · + Xn−1 . Let m ∈ N be the large deviation parameter, i.e., the number of thresholds. Define the stopping time σ m = min{n : Sn ≥ m, or Sn ≤ −1}. Then the quantity of interest defined in (1.3) becomes pm = P {Sσm ≥ m| S0 = 0} . The importance sampling estimates of pm for different values of m are given in Table 7.1 with sample size 100, 000, 000. For simplicity, we take these approximations as the true values of pm and compare the results from IPS scheme to them. Throughout this chapter, we use N as the number of particles of the IPS estimator and K as the sample size. {ˆ pi , i = 0, . . . , K−1} represents the collection of K number 102 103 m pm m pm 10 3.709502e-002 60 1.617500e-005 20 6.997822e-003 70 3.710000e-006 30 1.503110e-003 80 7.680000e-007 40 3.320130e-004 90 1.560000e-007 50 7.368103e-005 100 3.996866e-008 Table 7.1: estimates of pm for different m using Importance Sampling of IPS samples and 1 X K−1 pˆi K i=0 is the IPS estimate of pm . The Large Deviation Rate of the Second Moment In Figure 7.1, we compare the large deviation rate of the second moment of the IPS estimators for different m and N . In each case, the sample size K is taken to be 10, 000. The large deviation rate of the second moment is approximated by computing " # 1 X 2 K−1 − log pˆ . K i=0 i The black dashed line represents the asymptotic optimal rate −2 log pm which is ap- proximately an affine function of m for large m. Here pm are evaluated by important sampling. The theoretical results from previous chapter indicate that IPS scheme is asymptotic suboptimal and the large deviation rate of its second moment is a func- tion of N . However, as the number of particles N goes to ∞, IPS scheme becomes asymptotic optimal. As one can see in Figure 7.1, the large deviation rate of IPS converges to the asymptotic optimal rate as N grows, which supports the theoretical results. 104 large deviation rate of the second moment of IPS estimator for different N 35 Asymptotic Optimal Rate N=5 N = 10 N = 50 N = 10 30 N = 250 N = 500 N = 1000 N = 10000 25 pˆ2i ) PK −1 i=0 20 − log( K 1 15 10 5 10 20 30 40 50 60 70 80 90 100 m Figure 7.1: large deviation rate of the second moment of the interacting particle system estimator for different m and N 105 Trade-off Between N and K Another interesting question to ask related to IPS estimator is the trade-off between the number of particles N and number of samples K. We illustrate the problem in two aspects: 1. The empirical standard error: v u u 1 X K−1 σN = t pi − pˆ)2 (ˆ K K − 1 i=0 where pˆ is the sample mean of IPS estimates. 2. The per sample standard error: v ! u r u 1 X K−1 var (ˆ pi ) σN tvar pˆi = ≈ √K . K i=0 K K To begin with, we fixed the computational efforts. That is, we let the total number of simulated paths to be a constant, i.e. N × K = constant. The constant is taken to be 1, 000, 000 in our simulation. Figure 7.2 and 7.3 plot the decaying of the empirical standard error against the increasing of number of particles N for m = 10 and m = 50 respectively. Note that the sample size K is also varying since we keep the total computational effort fixed. N ranges from 5 to 100 with increment of size 1. Figure 7.4 compared the changes of per sample standard error to the changes N . The total range of N is within [5, 10000]. The top left graph plots the changes of √ σN K / K for N ∈ {5, 6, . . . , 100}. As one can see, there is a significant per sample standard error reduction for N less than 40. The number of particles in the top left graph, which is part of the bottom graph, ranges from 5 to 1, 000 with increment of size 5. It shows that there is no per sample standard error reduction for N greater 106 √ than 100. The bottom graph shows the overall trends of σ N K / K. In conclusion, the computational results suggest that the number of particles N should be chosen to be around 40 for our problem. K/N trade−off with fixed effort, m = 10 0.045 0.04 0.035 pk − pˆ)2 0.03 (ˆ 0.025 P K −1 1 0.02 q 0.015 0.01 0.005 0 10 20 30 40 50 60 70 80 90 100 N Figure 7.2: trade-off between N and K with fixed effort, N × K = 1, 000, 000, m = 10 General Results In this section, we present some basic results of IPS scheme under the i.i.d random walk setting with m = 10 and 50. Figure 7.5 and 7.6 plot the changes of sample error 1 X K−1 pˆi − pm K i=0 for fixed sample size K = 10, 000, where pm is the true value obtain from importance sampling. It appears that the sample error is asymptotic independent of N for large N which is what we expected. For fixed sample size K = 10, 000, Figure 7.7 and 7.8 show that the empirical 107 −4 K/N trade−off with fixed effort, m = 50 x 10 2 1.8 1.6 1.4 pk − pˆ)2 1.2 (ˆ P 1 K −1 1 q 0.8 0.6 0.4 0.2 0 10 20 30 40 50 60 70 80 90 100 N Figure 7.3: trade-off between N and K with fixed effort, N × K = 1, 000, 000, m = 50 standard error decreases as particle size N grows. Figure 7.9 plots the changes of empirical standard error with different sample size K while N is fixed. As expected, the estimates are roughly a constant. We predict the rate of decay of the second moment of IPS scheme in N is of order 1/N . If true, the statistic 1 PK−1 N K i=0 pˆ2i log (7.1) m (pm )2 should be a constant of N . However, such statistic requires m to be comparably large to N . In Figure 7.10, we can’t not see such trend because m is too small. In Figure 7.11, one can see that for N less than 1, 000, the quantity (7.1) is roughly a constant. However, since sample error is a constant of K, for fixed K (7.1) amplifies such error by size N . Hence, when N grows large in Figure 7.11, we see a increasing oscillation of (7.1). One of our future work will be theoretically proving such decay rate in N . 108 K/N trade−off with fixed effort, per sample standard error, m = 10 −5 −5 x 10 x 10 9.5 9.5 9 9 K K √K √K 8.5 8.5 σN σN 8 8 7.5 7.5 0 20 40 60 80 100 0 200 400 600 800 1000 N N −5 x 10 10 9 K 8 √K σN 7 6 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 N Figure 7.4: trade-off between N and K with fixed effort, N × K = 1, 000, 000, m = 10 109 −4 error, m = 10 x 10 7 6 5 4 pˆk − p 3 2  K 1 1 0 −1 −2 −3 0 200 400 600 800 1000 N Figure 7.5: sample error, m = 10, fixed K −6 error, m = 50 x 10 1 0.5 0 −0.5 −1 pˆk − p −1.5  K 1 −2 −2.5 −3 −3.5 −4 0 500 1000 1500 2000 2500 3000 3500 4000 N Figure 7.6: sample error, m = 50, fixed K 110 standard error, m = 10 0.045 0.04 0.035 0.03 pk − pˆ)2 0.025 (ˆ  0.02 K−1 1  0.015 0.01 0.005 0 0 200 400 600 800 1000 N Figure 7.7: standard error, m = 10, fixed K −4 standard error, m = 50 x 10 pk − pˆ)2 (ˆ 1  K−1 1  0 0 500 1000 1500 2000 2500 3000 3500 4000 N Figure 7.8: standard error, m = 50, fixed K 111 −4 standard error, m = 10 x 10 14 12 pk − pˆ)2 10 (ˆ  8 K−1 1  = 6 N σK 4 2 0 200 400 600 800 1000 K Figure 7.9: standard error, m = 10, fixed N decay rate in N, m = 10 1 0.9 0.8 pˆ2k 0.7 k p2  K 1 log 0.6 m N 0.5 0.4 0 200 400 600 800 1000 N Figure 7.10: decay rate in N, m = 10, fixed K 112 decay rate in N, m = 50 0.8 0.6 0.4 pˆ2k 0.2 k p2  K 1 log 0 m N −0.2 −0.4 −0.6 0 500 1000 1500 2000 2500 3000 3500 4000 N Figure 7.11: decay rate in N, m = 50, fixed K Bibliography [1] L. Breiman. Probability Theory. Addison-Wesley, Reading, Mass., 1968. [2] T. Dean and P. Dupuis. Splitting for rare event simulation: A large deviations approach to design and analysis. Stoch. Proc. Appl., 119:562–587, 2009. [3] T. Dean and P. Dupuis. The design and analysis of a generalized RESTART/DPR algorithm for rare event simulation. Math. of Operations Re- search, page to appear, 2010. [4] P. Del Moral and J. Garnier. Genealogical particle analysis of rare events. Annals of Applied Probability, 15:2496–2534, 2005. [5] P. Del Moral and P. Lezaud. Branching and interacting particle interpreta- tions of rare event probabilities. In Stochastic Hybrid Systems, Lecture Notes in Control and Information Science, volume 337. Springer, Berlin, 2006. [6] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Jones and Bartlett, Boston, 1993. [7] M.D. Donsker and S.R.S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, I. Comm. Pure Appl. Math., 28:1–47, 1975. [8] M.D. Donsker and S.R.S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, II. Comm. Pure Appl. Math., 28:279–301, 1975. 113 114 [9] P. Dupuis and R. S. Ellis. A Weak Convergence Approach to the Theory of Large Deviations. John Wiley & Sons, New York, 1997. [10] P. Dupuis and H. Wang. Importance sampling, large deviations, and differential games. Stoch. and Stoch. Reports., 76:481–508, 2004. [11] P. Dupuis and H. Wang. Dynamic importance sampling for uniformly recurrent Markov chains. Ann. Appl. Prob., 15:1–38, 2005. [12] P. Dupuis and H. Wang. Subsolutions of an Isaacs equation and efficient schemes for importance sampling. Math. Oper. Res., 32:1–35, 2007. [13] P. Dupuis and O. Zeitouni. A nonstandard form of the rate function for the occupation measure of a Markov chain. Stoch. Proc. and Their Appl., 61:249– 261, 1996. [14] M.J.J. Garvels. The Splitting Method in Rare Event Simulation. PhD thesis, University of Twente, The Netherlands, 2000. [15] M.J.J. Garvels and D.P. Kroese. A comparison of RESTART implementation. IEEE Publishers, Washington, DC, 1998. [16] P. Glasserman, P. Heidelberger, P. Shahabuddin, and T. Zajic. A large devia- tions perspective on the efficiency of multilevel splitting. IEEE Trans. Automat. Control, 43:1666–1679, 1998. [17] P. Glasserman, P. Heidelberger, P. Shahabuddin, and T. Zajic. Multilevel split- ting for estimating rare event probabilities. Operations Research, 47:585–600, 1999. [18] Z. Haraszti and J.K. Townsend. The theory of direct probability redistribution and its application to rare event simulation. ACM Transactions on Modelling and Computer Simulation, 9:105–140, 1999. 115 [19] H. Kahn and T.E. Harris. Estimation of particle transmission by random sam- pling. National Bureau of Standards Applied Mathematics Series, 12:27–30, 1951. [20] H. J. Kushner. Weak Convergence Methods and Singularly Perturbed Stochastic Control and Filtering Problems, volume 3 of Systems and control. Birkhaeuser, Boston, 1990. [21] S.P. Meyn and R.L. Tweedie. Markov Chains and Stochastic Stability, Second Edition. Cambridge University Press, New York, 2009. [22] M. Villen-Altamirano and J. Villen-Altamirano. RESTART: A method for ac- celerating rare event simulations. In Proc. of the 13th Iternational Teletraffic Congress, Queueing, Performance and Control in ATM, pages 71–76, Elsevier, Amsterdam, 1991. [23] M. Villen-Altamirano and J. Villen-Altamirano. RESTART: A straightforward method for fast simulation of rare events. In Proc. of the 1994 Winter Simulation Conference, pages 282–289, 1994. [24] M. Villen-Altamirano and J. Villen-Altamirano. Analysis of RESTART sim- ulation: Theoretical basis and sensitivity study. European Transactions on Telecommunications, 13:373–385, 2002. Index ∆, 23 ˜ ({n} |t), 70 φ Λm j , 33 ˜ (dx |t ), 65 π Θ, 33 GN (t, φ), 58, 76 αm j (A |x ), 36 αm I (ν), 59 j (dy |x ), 31 ¯ m , 35 Λ A (φ), 57 j ¯m Θ Cjm , 20 j , 35 η¯m (C1 × C2 × B), 43 F N , 28 η¯m (dn, dσ |t ), 43 H(x, α), 18 ¯ξ m (A1 × A2 × B), 43 L(x, β), 19 ¯ξ m (dx, dy |t ), 43 Lm , 22 ˘m N , 20 α j (dy |x ), 32 γ (A |v, t ), 24 Njm , 20 ¯m σ Un , 33 j , 35 m σ, 34 Xj,n , 20, 33 m µ, 18 Yj,n , 20 ρ, 19 Zim , 18 ¯ m , 44 L σ m , 18 ¯jm , 37 N σm j , 32 ¯ m , 42 X τC, 5 j θi (x), 17 pˆm , 21 ˜ m , 63 Λ Pm , 34 j γ˜ (A |x, t), 61 Qm , 34 ¯ jm , 35 Y 116 117 Sjm , 20 T1 , 63 T2 , 63 X , 34 Y, 34 Z, 34 p˜ (x, ·), 63 c (x, t), 63 g = (gnum , gloc ), 36 gjm = (gnum,j m m , gloc,j ), 33 h, 33 m, 19 pm , 18 q m (C1 × C2 × B), 44 q m (dn, dσ |t), 43 rm (A1 × A2 × B), 44 rm (dx, dy |t), 43