Essays in Econometrics of Heterogeneous Agents by Yuya Sasaki B. S., Utah State University, 2002 M. S., Utah State University, 2007 M. A., Brown University, 2008 A Dissertation submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the Department of Economics at Brown University PROVIDENCE, RHODE ISLAND MAY 2012 ⃝ c Copyright 2012 by Yuya Sasaki This dissertation by Yuya Sasaki is accepted in its present form by the Department of Economics as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date Frank Kleibergen, Advisor Recommended to the Graduate Council Date Eric Renault, Reader Date Susanne Schennach, Reader Approved by the Graduate Council Date Peter M. Weber, Dean of the Graduate School iii Vitae The author was born on June 22nd, 1979 in Tokyo, Japan. He received B.S. from Utah State University in 2002, and M.S. from Utah State University in 2007. He entered Brown University in 2007, and received M.A. in 2008 and Ph.D. in 2012. iv Acknowledgements I am indebted to my advisor, Frank Kleibergen, for continuous support and guid- ance. I also thank Stefan Hoderlein, Blaise Melly, Eric Renault, and Susanne Schen- nach for constructive advice throughout my dissertation. Ken Chay, Kaivan Munshi, and Anna Aizer helped me to know how to produce econometric methods which may be of practical use. I benefited from discussion with my colleagues, Toru Kita- gawa, Zhaoguo Zhan, Alexei Abrahams, Philipp Ketz, Maria Jose Boccardi, Andrew Elzinga, Bruno Gasperini, and Daniela Scida. I am thankful to my family members for their patience and support. v Contents Vitae iv Acknowledgements v List of Tables x List of Figures xi Chapter 1. Heterogeneity and Selection in Dynamic Panel Data 1 1. Introduction 1 2. Background 4 3. An Overview 7 3.1. A Sketch of the Identification Strategy 8 3.2. A Sketch of the Identification Strategy for T = 6 16 4. Identification 18 4.1. Identifying Restrictions 19 4.2. Representation 22 4.3. The Main Identification Result 23 5. Estimation 24 5.1. Constrained Maximum Likelihood 25 5.2. An Estimator 27 5.3. Monte Carlo Evidence 28 6. Empirical Illustration: SES and Mortality 31 6.1. Background 31 6.2. Empirical Model 33 6.3. Data 34 6.4. Empirical Results 35 vi 7. Summary 41 8. Appendix: Proofs for Identification 41 8.1. Lemma 1 (Representation) 41 8.2. Lemma 2 (Identification) 44 8.3. Lemma 3 (Independence) 51 8.4. Lemma 4 (Invariant Transition) 53 9. Appendix: Proofs for Estimation 54 9.1. Corollary 1 (Constrained Maximum Likelihood) 54 9.2. Remark 9 (Unit Lagrange Multipliers) 57 9.3. Proposition 1 (Consistency of the Nonparametric Estimator) 60 9.4. Semiparametric Estimation 64 10. Appendix: Special Cases and Generalizations of the Baseline Model 67 10.1. A Variety of Missing Observations 67 10.2. Identification without a Nonclassical Proxy Variable 70 10.3. Models with Higher-Order Lags 79 10.4. Models with Time-Specific Effects 89 10.5. Censoring by Contemporaneous Dt instead of Lagged Dt 91 Chapter 2. Structural Partial Effects 98 1. Introduction 98 2. The Local Instrumental Variable Estimator 99 3. Marginal Treatment Effects 103 4. Two Special Cases of the MTE 106 4.1. Case 1: When First Stage Is a Mean Regression 106 4.2. Case 2: When First Stage Is a Quantile Regression 107 5. Nonlinear Heterogeneous Effects of Smoking 109 6. Summary 116 7. Mathematical Appendix 118 7.1. Proof of Theorem 2 118 7.2. Proof of Theorem 3 119 vii 7.3. Proof of Theorem 4 121 7.4. Proof of Proposition 3 123 Chapter 3. Nonparametric Model Tests with Discrete Instruments 130 1. Introduction 130 2. Bounds as Means of Specification Testing 132 2.1. Bounds of the APE and Their Implications for the Hypotheses 133 2.2. Numerical Illustrations 138 3. The Test Statistics 139 3.1. Test of the Hypothesis H0S 141 3.2. Test of the Hypothesis H0A 142 3.3. Monte Carlo Evidences 144 4. Testing with Empirical Data 144 4.1. Returns to Schooling 145 4.2. Smoking and Infant Birth Weights 149 5. Conclusion 152 6. Appendix: Well-Defined Conditional Expectations 152 ˜ 7. Appendix: Covariance Matrices Γ and Γ 153 8. Appendix: Auxiliary Lemmas 153 8.1. Lemma 18 153 8.2. Lemma 19 154 8.3. Lemma 20 154 8.4. Lemma 21 156 8.5. Lemma 22 156 8.6. Lemma 23 160 8.7. Lemma 24 162 9. Appendix: Proofs of the Theorem and the Propositions 163 9.1. Proof of Theorem 5 163 9.2. Proof of Proposition 4 164 9.3. Proof of Proposition 5 165 viii Bibliography 167 ix List of Tables 1.1 MC-simulated distributions of parameter estimates. 29 1.2 Model estimates with height as a proxy. 36 2.1 Descriptive statistics of the data. 111 2.2 Summary of identified structural parameters and the respective first-stage models. 118 3.1 Results of specification tests for Angrist & Krueger (1991) data. 148 3.2 Results of specification tests for smoking and infant birth weights. 151 x List of Figures 1.1 A sketch of the proof of the identification strategy. 15 1.2 Causal diagram for six periods. 18 1.3 Causal relationships among early human capital, socioeconomic status, and mortality in adulthood. 32 1.4 Markov probabilities of employment in the next two years. 37 1.5 Conditional survival probabilities in the next two years. 38 1.6 Markov probabilities of employment in the next two years among the subpopulation of individuals who reported health problems that limit work in 1971. 39 1.7 Conditional survival probabilities in the next two years among the subpopulation of individuals who reported health problems that limit work in 1971. 39 1.8 Markov probabilities of employment in the next two years among the subpopulation of individuals who eventually died from acute diseases according to death certificates. 40 1.9 Conditional survival probabilities in the next two years among the subpopulation of individuals who eventually died from acute diseases according to death certificates. 40 1.10 Counterfactual simulations. 42 2.1 The role of monotonicity in related identification results. 109 2.2 Confidence intervals of E[β(X, S, A) | Y = 2500, X = x, Z = 0.30, S = s¯]. 111 2.3 Confidence intervals of E[β(X, S, A) | Y = 3000, X = x, Z = 0.30, S = s¯]. 112 xi 2.4 Confidence intervals of E[β(X, S, A) | Y = 2500, X = x, Z = 0.40, S = s¯]. 112 2.5 Confidence intervals of E[β(X, S, A) | Y = 3000, X = x, Z = 0.40, S = s¯]. 113 2.6 Confidence intervals of E[β(X, S, A) | Y = 2500, X = x, Z = 0.50, S = s¯]. 113 2.7 Confidence intervals of E[β(X, S, A) | Y = 3000, X = x, Z = 0.50, S = s¯]. 114 2.8 Confidence intervals of E[β(X, S, A) | X = x, Z = 0.30, S = s¯]. 115 2.9 Confidence intervals of E[β(X, S, A) | X = x, Z = 0.40, S = s¯]. 115 2.10 Confidence intervals of E[β(X, S, A) | X = x, Z = 0.50, S = s¯]. 116 3.1 Relationships between the true AP E(10, z) and their bounds. 140 3.2 Graphical characterizations of specification tests. 143 3.3 Simulated power curves of the specification tests. 145 3.4 Heterogeneous first-stage effects across years of schooling and quarters of birth. Sample: place of birth from Arkansas, Kentucky, or Tennessee for all birth years. 147 3.5 Bounds and confidence regions of AP E(9.5, z) for z = 2, 3, 4. Sample: place of birth in Arkansas, Kentucky, or Tennessee. 148 3.6 Heterogeneous first-stage effects across number of cigarettes smoked and cigarette tax rate. 150 3.7 Bounds and confidence regions of AP E(5, z) for z = 1, 2, 3 with h = 3. 150 xii Abstract of “Essays in Econometrics of Heterogeneous Agents” by Yuya Sasaki, Ph.D., Brown University, May 2012 Economic models often involve non-separability between observed and unobserved heterogeneous characteristics of economic agents. This dissertation presents methods of identification, estimation, and inference of nonparametric and nonseparable eco- nomic models for cross section and panel data. The first chapter discusses identifica- tion and estimation of nonseparable dynamic panel data with non-random dynamic selection. It shows that nonseparable dynamic panel models with endogenous attri- tion can be identified from six time periods of unbalanced panel data. The principle of constrained maximum likelihood is proposed for consistent estimation. The second chapter discusses identification of average structural partial effects for endogenous nonseparable cross-section models without assuming monotonicity. Nonparametric identification methods are proposed for various first-stage structural and reduced- form assumptions. The third chapter discusses statistical methods of model tests for endogenous nonseparable cross-section models when instruments exhibit discrete variations and the outcome structure is not monotone with respect to unobserved heterogeneity. It shows that the testing method possesses sufficient power even if instruments are discrete and exert only local effects on endogenous choice. CHAPTER 1 Heterogeneity and Selection in Dynamic Panel Data 1. Introduction Dynamics, nonseparable heterogeneity, and selection have been separately treated in the panel data literature, in spite of their joint relevance to a wide array of applica- tions. First, common economic variables of interest are modeled to follow dynamics, e.g., assets, income, physical capital and human capital. Second, many economic models entail nonseparable heterogeneity, i.e., an additively separable residual does not summarize abilities, preferences and technologies. Third, most empirical panel data are unbalanced by (self-) selection. Indeed, consideration of these three issues – dynamics, nonseparable heterogeneity, and selection – is essential, but existing econo- metric methods do not handle them at the same time. To fill this gap, this paper proposes a set of conditions for identification of dynamic panel data models in the presence of both nonseparable heterogeneity and dynamic selection.1 Nonparametric point identification is achieved by using information in- volving either a proxy variable or a slightly longer panel. Specifically, the model is point-identified using T = 3 periods of unbalanced panel data and a proxy variable. A special case of this identification result occurs by constructing the proxy variable from three additional periods, i.e., T = 6 in total. 1 In the introductory section of his monograph, Hsiao (2003) particularly picks up heterogeneity and selection as the two major sources of bias in panel data analysis, which motivates the goal of this paper. 1 For example, consider the dynamics of socio-economic status (SES) and its causal effects on adult mortality.2 Many unobserved individual characteristics such as ge- netics, patience and innate abilities presumably affect both SES and survival in non- additive ways, which would incur a bias unless explicitly accounted for. Furthermore, a death outcome of the survival selection induces subsequently missing observations, which may result in a selection bias, often called survivorship bias. The following dynamic panel model with selection accommodates this example:      Yt = g(Yt−1 , U, Et ) t = 2, · · · , T (Dynamic Panel Model)    Dt = h(Yt , U, Vt ) t = 1, · · · , T − 1 (Selection Model)     FY U (Initial Condition) 1 The first equation models the dynamics of the observed state variable Yt , such as SES, as a first-order Markov process with unobserved heterogeneity U . The second equation models a binary choice of the selection variable, Dt , such as survival, as a Markov decision process with unobserved heterogeneity U . The initial condition FY1 U models the dependence of the initial state Y1 on unobserved heterogeneity U .3 The period-specific shocks (Et , Vt ) are exogenous, leaving the fixed effect U as the only source of endogeneity. The economic agent drops out of the panel upon Dt = 0, as in the case of death. Consequently, data is observed in the following manner: (Y2 , D2 ) is observed if D1 = 1; (Y3 , D3 ) is observed if D1 = D2 = 1; and so on. Heckman and Navarro (2007) introduced this formulation of dynamic selection. 2Among many biological and socioeconomic factors of mortality (Cutler, Deaton, and Lleras- Muney, 2006), the role of SES and economic environments has been investigated by a number of empirical researches (e.g., Ruhm, 2000; Deaton and Paxson, 2001; Snyder and Evans, 2006; Sullivan and von Wachter, 2009a,b). 3 The distribution FY1 U features the initial conditions problem for dynamic panel data models. See Wooldridge (2005) and Honor´e and Tamer (2006) for discussions on the initial conditions problem in the contexts of nonlinear and binary outcome models. Blundell and Bond (1998) and Hahn (1999) use semiparametric distributions to obtain identifying restrictions and efficiency gain. In applications, the initial condition FY1 U together with the function g are important to disentangle spurious state dependence of a long-run outcome (Heckman, 1981a,b). 2 How can we nonparametrically point identify the nonseparable functions (g, h) and the initial condition FY1 U under this setup of endogenously unbalanced panel data? Common ways to handle selection include matching and weighting. These approaches, however, presume selection on observables, parametric models, and ad- ditively separable models, none of which is assumed in this paper. Even without selection, the standard panel data techniques such as first differencing, demeaning, projection, and moment restrictions do not generally work for nonseparable and non- parametric models. The literature on nonseparable cross section models proposes constructing auxil- iary variables, such as a proxy variable or a control variable, to remove endogeneity (e.g., Garen, 1984; Imbens and Newey, 2009).4 Likewise, Altonji and Matzkin (2005) show that a control variable can be also constructed from panel data for sibling and neighborhood panels. This paper complements Altonji and Matzkin along two dimen- sions. First, we show that a proxy variable can be constructed from dynamic panel data, similar to their construction of a control variable from sibling and neighbor- hood panel data. Second, the proxy variable, akin to a control variable,5 handles not only nonseparable heterogeneity, but also dynamic selection. We propose a method of using the proxy variable to nonparametrically difference out both nonseparable heterogeneity and dynamic selection at the same time. The nonparametric differencing relies on a nonclassical proxy variable, which we define as a noisy signal of true unobserved heterogeneity with a nonseparable noise. This definition is reminiscent of nonclassical measurement errors (ME).6 A natural 4Chesher (2003) can be also viewed as a control variable method, cf. Imbens and Newey (2009; Theorem 2). 5 Proxy and control variables are similar in that both of them are correlated with unobserved factors. But they differ in terms of independence conditions: if X denotes an endogenous regressor and U denotes unobserved factors, then a proxy variable Z and a control variable Z ′ satisfy Z ⊥⊥ X | U and U ⊥⊥ X | Z ′ , respectively. 6 See Lewbel (2006), Mahajan (2006), Schennach (2007), Hu (2008), Hu and Schennach (2008), and Schennach, Song, and White (2011) for the literature on the nonclassical ME. 3 approach to identification, therefore, is to adapt the methods used in the nonclassical ME literature to the current context. This paper follows the spectral decomposition approach (e.g., Hu, 2008; Hu and Schennach, 2008) to nonparametrically identify mixture components.7 The identification procedure is outlined as follows. First, the method of nonpara- metric differencing removes the influence of nonseparable heterogeneity and selection. After removing these two sources of bias, the spectral decomposition identifies the mixture component fYt |Yt−1 U , which in turn represents the observational equivalence class of the true nonseparable function g by normalizing the distribution of the exoge- nous error Et , following Matzkin (2003, 2007). This sequence yields nonparametric point identification of g from short unbalanced panel data. The selection function h can be similarly identified by a few additional steps of the spectral decomposition and solving integral equations8 to identify representing mixture components. 2. Background Selection is of natural interest in panel data analysis because attrition is an issue in most, if not all, panel data sets. While many applications focus on the dynamic model g as the object of primary interest, the selection function h also helps to explain important causal effects in a variety of economic problems. In the SES and mortality example, identification of the survival selection function h allows us to learn about the causal effects of SES on mortality. Generally, the selection function h can be used to model hazards of panel attrition. Examples include (i) school dropout (Cameron and Heckman, 1998; Eckstein and Wolpin, 1999; Belzil and Hansen, 2002; Heckman and Navarro, 2007); (ii) retirement from a job (Stock and Wise, 1990; Rust and Phelan, 1997; Karlstrom, Palme, and Svensson, 2004; French, 2005; Aguirregabiria, 2010; French and Jones, 2011); (iii) replacement of depreciated capital (Rust, 1987) 7 See Henry, Kitamura, and Salani´e (2010) for general identification results for mixture models. 8Precisely, they are the Fredholm equations of the first kind. See Carrasco, Florens, and Renault (2007). 4 and replacement of managers (Brown, Goetzmann, Ibbotson, and Ross, 1992); (iv) sterilization (Hotz and Miller, 1993); (v) exit from markets (Aguirregabiria and Mira, 2007; Pakes, Ostrovsky, and Berry, 2007); (vi) recovery from a disease (Crawford and Shum, 2005); and (vii) death (Contoyannis, Jones, and Rice, 2004; Halliday, 2008). Examples (i)–(v) are particularly motivated by rational hazards formulated in the following structural framework. Example 1 (Optimal Stopping as a Rational Choice of Hazard). Suppose that an economic agent knows her current utility or profit as a function π of state yt and heterogeneity u. Let vtd denote a selection-specific private shock for each choice d ∈ {0, 1}, which is known to the agent. She also knows her exit value as a function ν of state yt and heterogeneity u. Using the dynamic function g, define the value function ν as the fixed point of the Bellman equation ν(yt , u) = E[max{π(yt , u) + Vt1 + βE[ν(g(yt , u, Et+1 ), u)], π(yt , u) + Vt0 + βν(yt , u)}], where β denotes the rate of time preference. The reduced-form self-selection function h is then defined by h(yt , u, vt ) := 1{βE[ν(g(yt , u, Et+1 ), u)] − βν(yt , u) > vt0 − vt1 }. | {z } | {z } | {z } Continuation value Exit Value || vt The agent decides to exit at time t if h(Yt , U, Vt ) = 0. Identification of the reduced form h is important in many applications.9 Moreover, the reduced form h also reveals the heterogeneous conditional choice probability (CCP), fDt |Yt U , which in turn can be used to recover heterogeneous structural primitives by using the method of Hotz and Miller (1993).10  9Counterfactual policy analysis is often possible with reduced-form selection function as a sufficient statistic; see the Marschak’s (1953) maxim discussed by Heckman (2000) and Heckman and Vytlacil (2007). 10 I keep identification of the primitives out of the scope of this paper. Primitives are known to be generally under-identified without additional restrictions (Rust, 1994; Magnac and Thesmar, 2002; Pesendorfer and Schmidt-Dengler, 2008). These features may be more generally treated in the literature of set identification and set inference, e.g., Bajari, Benkard, and Levin (2007) and the follow-up literature. 5 As this example suggests, nonparametric identification of the heterogeneous CCP follows as a byproduct of our identification results,11 showing a connection between this paper and the literature on structural dynamic discrete choice models. When at- trition, Dt = 0, is associated with hazards or ends of some duration, our identification results also entail nonparametric identification of the mixed hazard model and the distribution of unobserved heterogeneity.12 In this sense, our objective is also related to the literature on duration analysis (e.g., Lancaster, 1979; Elbers and Ridder, 1982; Heckman and Singer, 1984; Honor´e, 1990; Ridder, 1990; Horowitz, 1999; Ridder and Woutersen, 2003). The paper covers three econometric topics, (A) panel data, (B) selection/missing data, and (C) nonseparable models. To show the place of this paper, I briefly discuss these related branches of the literature. Because the field is extensive, the following list is not exhaustive. (A) and (B): panel data with selection has been discussed from the perspective of (i) a selection model (Hausman and Wise, 1979; Das, 2004), (ii) variance adjustment (Baltagi, 1985; Baltagi and Chang, 1994), (iii) additional data such as refreshment samples (Ridder, 1992; Hirano, Imbens, Ridder, and Rubin, 2001; Bhattacharya, 2008), (iv) matching (Kyriazidou, 1997), (v) weighting (Hellerstein and Imbens, 1999; Moffitt, Fitzgerald, and Gottschalk, 1999; Wooldridge, 2002), and (vi) partial identi- fication (Khan, Ponomareva, and Tamer, 2011). We contribute to this literature by allowing nonseparability in addition to selection/missing data. Identification of CCP under finite heterogeneous types has been discussed by Magnac and Thes- mar (2002) and Kasahara and Shimotsu (2009). Aguirregabiria and Mira (2007) considered market- level unobserved heterogeneity as a variant of their main model. While we focus on identification of heterogeneous CCP, Arcidiacono and Miller (2011) suggested a method of estimating heterogeneous CCP. 11 Taking the expectation of h(y, u, · ) with respect to the distribution of the exogenous error Vt yields the heterogeneous CCP, fDt |Yt U (1 | y, u) for each (y, u). The heterogeneous CCP is also identified by Kasahara and Shimotsu (2009), which this paper complements by introducing missing observations in data. 12 The nonparametric mixed hazard model and the marginal distribution F of unobserved U heterogeneity follow from the identified survival selection function h and the initial condition FY1 U , respectively. 6 (A) and (C): nonseparable panel models have been treated with (i) random coef- ficients and interactive fixed effects (Hsiao, 1975; Pesaran and Smith, 1995; Hsiao and Pesaran, 2004; Graham annd Powell, 2008; Arellano and Bonhomme, 2009; Bai, 2009). (ii) bias reduction (discussed in the extensive body of literature sur- veyed by Arellano and Hahn, 2005), (iii) identification of local partial effects (Altonji and Matzkin, 2005; Altonji, Ichimura, and Otsu, 2011; Graham and Powell, 2008; Arellano and Bonhomme, 2009; Bester and Hansen, 2009; Chernozhukov, Fern´andez- Val, Hahn, and Newey, 2009; Hoderlein and White, 2009), (iv) partial identification (Honor´e and Tamer, 2006; Chernozhukov, Fern´andez-Val, Hahn, and Newey, 2010), (v) partial separability (Evdokimov, 2009), and (vi) assumptions of surjective and/or injective operators (Kasahara and Shimotsu, 2009; Bonhomme, 2010; Hu and Shum, 2010; Shiu and Hu, 2011). The paper contributes to this literature by introducing selection/missing data in addition to allowing nonseparability. Identification of a nonseparable dynamic panel data model is studied by Shiu and Hu (2011) who use independently evolving covariates as auxiliary variables, similar to one of the two identification results of this paper using a proxy as an auxiliary variable. This paper complements Shiu and Hu along two dimensions. First, our identification result using T = 6 periods eliminates the need to assume the independently evolving covariates or any other auxiliary variable. Second, we can allow for selection/missing data in addition to dynamics and nonseparability. 3. An Overview We start out with an informal overview of the identification strategy in this section, followed by formal identification results summarized in Section 4. Briefly described, the nonparametric differencing method works in the following manner. Let z denote a proxy variable. Observed data Az , which are contaminated by mixed heterogeneity and selection, can be decomposed as Az = Bz C, where Bz contains model information and C contains the two sources of bias, i.e., heterogeneity and selection. The contaminant holder, C, does not depend on z by an exclusion 7 restriction. Thus, using two values of z, say z = 0, 1, selectively eliminates C by the operator composition A1 A−1 0 = B1 CC −1 −1 B0 = B1 B0−1 , without losing the model information Bz . This shows how heterogeneity and selection contained in C are nonparametrically differenced out, and is analogous to the familiar first differencing method which eliminates fixed effects by using two values of t instead of two values of z. Section 3.1 sketches the identification strategy using T = 3 periods of panel data and a proxy variable. An intuition is the following. First, using variations in Y1 in the equation y2 = g(Y1 , U, E2 ) involving the first two periods, t = 1, 2, we can retrieve information about (U, E2 ) associated with Y2 = y2 . This is comparable to the first stage in the cross section context except for the endogeneity of the first-stage regressor Y1 . The proxy variable, which is correlated with Y1 only through U , disentangles U and Y1 to fix the endogeneity. We then use this knowledge about U to identify the heterogeneous dynamic through the equation Y3 = g(y2 , U, E3 ) involving the latter two periods, t = 2, 3, which is comparable to the second stage. Section 3.2 sketches the identification strategy using T = 6 periods without a proxy variable. With six periods, the three consecutive observations, Y2 , Y3 and Y4 , together constitute a substitute for the proxy. Intuitively, controlling for the adjacent states, Y2 and Y4 , the intermediate state Y3 is correlated with (Y1 , Y5 , Y6 ) only through the heterogeneity U . This allows Y3 to serve as a proxy for U , conditionally on Y2 and Y4 . The constructed proxy identifies both FY6 |Y5 U , which represents the dynamic function g, and the initial condition FY1 U . 3.1. A Sketch of the Identification Strategy. Consider the model which consists of (g, h, FY1 U , ζ, FEt , FVt , FW ) where      Yt = g(Yt−1 , U, Et ) t = 2, · · · , T (State Dynamics)      Dt = h(Yt , U, Vt ) t = 1, · · · , T − 1 (Selection) (1)     FY1 U (Initial Condition)      Z = ζ(U, W ) (Optional: Nonclassical Proxy) 8 The observed state variable Yt , such as SES, follows a first-order Markov process g with nonseparable heterogeneity U . The selection variable Dt , such as survival, follows a Markov decision process h with heterogeneity U . The outcome Dt = 0, such as death, indicates attrition after which the counterfactual state variable Yt becomes unobservable. The distribution FY1 U of (Y1 , U ) models dependence of the initial state Y1 on unobserved heterogeneity U . The last optional equation models the proxy variable Z as a noisy signal of the true unobserved heterogeneity U with a nonseparable noise variable W .13 This proxy equation is optional when T > 6, because the three additional periods construct the proxy. The functional relations in (1) together with the following exogeneity assumptions define the econometric model. (i) Exogeneity of Et : Et ⊥⊥ (U, Y1 , {Es }s 0 for each u ∈ {0, 1} E[U | Y2 = y, Y1 = 0, D1 = 1] ̸= (11) Initial Heterogeneity: E[U | Y2 = y, Y1 = 1, D1 = 1] Restriction (8) requires that the dynamic model g is a nontrivial function of the un- observed heterogeneity, and implies that the matrix Py is non-singular.16 Restriction (9) requires that the proxy model (ζ, FW ) exhibits nondegeneracy, and implies that the matrix Q0 is non-singular. Restriction (10) requires a positive survival probability for each heterogeneous type u ∈ {0, 1}, and hence drives no type U into extinction. Restriction (11) requires that the unobserved heterogeneity is present at the initial observation. The last two restrictions (10) and (11) together imply that the nonpara- ˜ y is non-singular. metric residual matrix L ˜ y containing the two sources of bias has Now that the nonparametric residual L gone, it remains to identify the elements of the matrices Py and Qz from equation (7). This can be accomplished by showing the uniqueness of eigenvalues and eigenvectors (e.g., Hu, 2008; Kasahara and Shimotsu, 2009). Because the matrix Qz is diagonal, (7) forms a diagonalization of the observable matrix Ly,1 L−1 y,0 . The diagonal elements of Q1 Q−1 −1 0 and the columns of Py are the eigenvalues and the eigenvectors of Ly,1 Ly,0 , 16Under the current simplified setting with Bernoulli random variables, a violation of this assumption implies absence of endogeneity in the dynamic model, and thus the dynamic function g would still be identified. However, the other functions are not guaranteed to be identified without this assumption. 13 respectively. Therefore, Q1 Q−1 0 is identified by the eigenvalues of the observable matrix Ly,1 L−1 y,0 without an additional assumption. On the other hand, identification of Py follows from the following additional re- striction: (12) Relevant Proxy: E[ζ(0, W )] ̸= E[ζ(1, W )]. This restriction (12) requires that the proxy model ζ is a nontrivial function of the true unobserved heterogeneity on average. It characterizes the relevance of Z as a proxy of U , and implies that the elements of Q1 Q−1 −1 0 (i.e., the eigenvalues of Ly,1 Ly,0 ) are distinct. The distinct eigenvalues uniquely determine the corresponding eigenvectors up to scale. Since an eigenvector [fYt+1 |Yt U (0 | y, u), fYt+1 |Yt U (1 | y, u)]′ is a vector of conditional densities which sum to one, the scale is also uniquely determined. Therefore, Py and Q1 Q−1 −1 0 are identified by the observed data Ly,1 Ly,0 . The identified eigenvalues17 take the form of the proxy odds fZ|U (1 | u)/(1 − fZ|U (1 | u)), which in turn uniquely determines the diagonal elements fZ|U (z | u) of Qz for each z. This procedure heuristically shows how the the elements (g, ζ) of the model is identified from endogenously unbalanced panel data. Remark 1. The general identification procedure consists of six steps. The current subsection presents a sketch of the first step to identify (g, ζ). Five additional steps show that the remaining two elements (h, FY1 U ) of the model are also identified. Figure 3.1 summarizes all the six steps. Section 4 presents a complete identification result. Discussion 1. This sketch of the identification strategy demonstrates how the proxy handles both selection and nonseparable heterogeneity at the same time. The trick of Equation (5) or (6) is to isolate the selection (D1 = D2 = 1) and the non- parametric distribution of the nonseparable heterogeneity U into the nonparametric 17 Even though we obtain real eigenvalues in this spectral decomposition, Ly,1 L−1 y,0 need not be symmetric. Note that a Hermitian operator is sufficient for real spectrum, but not necessary. This identification result holds as far as the identifying restrictions are satisfied. 14 _ _ _ _ _ _ _ _  Step 2 Q(·) S(·) o  Observable  Proof Sketch jjjj jj5 FZ|U FY2 Y1 U D1 (·, ·, ·, 1)  FY2 Y1 D1 (·, ·, 1)  Step 1 jjjj r9 JJ _ _ _ _ _ _ _ _ jj rr JJ jjj (QR) rrr JJ jjjj (SD) rr JJ JJ jjjj  Step 2 rrr rr JJ Step 5 JJ L(·,1) L−1 rr JJ ζ r JJ rr (·,0) Step 6 O TTTT TTTT (SD) rr rr (OI) (OI) JJJJ TTTT rrr JJ JJ TTTT rr JJ TTTT rr J% Step 1 TT) rr  P(·) Step 6 Step 6 R(·) Step 1 / FY U o FY3 |Y2 U 1 FD2 |Y2 U : uu uu (QR) uuu (QR) _ _ _ _ _ _ _ _ _ _ _ _ _ _  u uu  Step 5 uuu   Observable L(·,·)  g uu uu h  FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1)  uu _ _ _ _ _ _ _ _ _ _ _ _ _ _ uu uuu u uu uu T(·) Step 5 ′ T R(·) S(·) = T(·) Step 3 / (·) FY1 |Y2 U D2 D1 (· | ·, ·, 1, 1) FY2 Y1 U D2 D1 (·, ·, ·, 1, 1) 5 O Step 3kkkk kkkkkkk kk  kkkk (SD) L−1 (·,0) L(·,1) (AD) Step 4 Step 5  _ _ _ _ _ _ _ _ _ _  ∗ Step 4 ′ Step 4 T(·) / T(·) o  Observable  FY1 |Y2 U D2 D1 (· | ·, ·, 1, 1) (OI) FY2 U D2 D1 (·, ·, 1, 1)  FY2 Y1 D2 D1 (·, ·, 1, 1)  _ _ _ _ _ _ _ _ _ _ (AD) – Adjoint Operator (OI) – Operator Inversion (QR) – Quantile Representation (SD) – Spectral Decomposition Figure 1.1. A sketch of the proof of the identification strategy. ˜ y , which in turn is eliminated in Equation (7). Our method thus residual matrix L can be considered as a nonparametric differencing facilitated by a proxy variable, nonparametrically differencing out both nonseparable fixed effects and endogeneous selection. This process is analogous to the first differencing method which differences out fixed effects arithmetically. Our nonparametric differencing occurs in the non- commutative group of matrices (generally the group of linear operators), whereas the first differencing occurs in (R, +). In the non-commutative group, the proxy Z plays ˜ y while leav- the role of selectively canceling out the nonparametric residual matrix L ing the Py and Qz matrices intact. The use of a proxy parallels the classical idea of using instruments as means of removing endogeneity (Hausman and Taylor, 1981). Instrumental variables are useful for additive models because projection (moment restriction) of additive models on instruments removes fixed effects as in Hausman 15 and Taylor. This projection method is not generally feasible for nonseparable and nonparametric models. Therefore, this paper uses a proxy variable, akin to a con- trol variable,18 to nonparametrically difference out the nonseparable fixed effect along with selection as argued above. This point is revisited in Section 3.2: Discussion 3. 3.2. A Sketch of the Identification Strategy for T = 6. When T = 6, we identify the model (g, h, FY1 U ) without using an outside proxy variable or the proxy model ζ. In the presence of an outside proxy, the main identification strategy was to ˜ y from which L derive the decomposition Ly,z = Py Qz L ˜ y was eliminated (cf. Section 3.1). A similar idea applies to the case of T = 6 without an outside proxy. Again, assume that Yt , U , and Z follow the Bernoulli distribution for ease of exposition. Let Z := Y3 for notational convenience. Using the exogeneity restriction (3) yields the following decomposition of the observed data. z}|{ y3 ∑ fY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 (y6 , y5 , y4 , z , y2 , y1 , 1, 1, 1, 1, 1) = fY6 |Y5 U (y6 | y5 , u) | {z } | {z } u Observed from Data → Ly5 ,y4 ,z,y2 Model g → Py5 × fY4 ZD5 D4 D3 |Y2 U (y4 , z, 1, 1, 1 | y2 , u) · fY5 |Y4 U (y5 | y4 , u) · fY2 Y1 U D2 D1 (y2 , y1 , u, 1, 1) | {z } | {z } An Alternative to Proxy Model ζ → Qy4 ,z,y2 To Be Eliminated → L ˜ y ,y ,y 5 4 2 This equality can be equivalently written in terms of matrices as Ly5 ,y4 ,z,y2 = Py5 ˜ y5 ,y4 ,y2 for each (y5 , y4 , z, y2 ), where the 2 × 2 matrices are defined as Qy4 ,z,y2 L Ly5 ,y4 ,z,y2 := [fY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 (i, y5 , y4 , y3 , y2 , j, 1, 1, 1, 1, 1)](i,j)∈{0,1}×{0,1} [ ] Py5 := fY6 |Y5 U (i | y5 , j) (i,j)∈{0,1}×{0,1} ( )′ Qy4 ,z,y2 := diag fY4 ZD5 D4 D3 |Y2 U (y4 , z, 1, 1, 1 | y2 , 0) fY4 ZD5 D4 D3 |Y2 U (y4 , z, 1, 1, 1 | y2 , 1) [ ] ˜y y y L 5 4 2 := fY5 |Y4 U (y5 | y4 , u) · fY2 Y1 U D2 D1 (y2 , j, i, 1, 1) (i,j)∈{0,1}×{0,1} Similarly to the case with an outside proxy, varying z = y3 while fixing (y5 , y4 , y2 ) ˜ y5 ,y4 ,y2 because it does not depend on z = y3 . Under rank restrictions, eliminates L 18 If X denotes an endogenous regressor and U denotes unobserved factors, then a proxy, a control variable, and an instrument are characterized by Z ⊥⊥ X | U , X ⊥⊥ U | Z, and Z ⊥⊥ U , respectively. The conditional independence (4) thus characterizes Z as a proxy rather than a control variable or an instrument. 16 the composition Ly5 ,y4 ,1,y2 L−1 ˜ ˜ −1 −1 −1 −1 y5 ,y4 ,0,y2 = Py5 Qy4 ,1,y2 Ly5 y4 y2 Ly5 y4 y2 Qy4 ,0,y2 Py5 = Py5 Qy4 ,1,y2 Qy4 ,0,y2 Py5 | {z } | {z } Observed Data Diagonal yields the eigenvalue-eigenvector decomposition to identify the dynamic model g rep- resented by the matrix Py5 . Five additional steps identify the rest (h, FY1 U ) of the model. Discussion 2. Why do we need T = 6? For convenience of illustration, ignore selection. The arrows in the diagram below indicate the directions of the causal effects. We are interested in g and FY1 U enclosed by round shapes. First, note that a variation in Y6 in response to a variation in (Y5 , U ) reveals g. Second, a variation in U in response to a variation in Y1 reveals FU |Y1 , hence FY1 U . We can see from the causal diagram that Y2 , Y3 , and Y4 are correlated with U , and hence we may conjecture that they could serve as a proxy of U . However, any of them, say Z := Y3 , cannot be a genuine proxy because the redundant proxy assumption similar to (4), say Z ⊥⊥ Y5 | Z, would be violated with this choice Z = Y3 . That is, even if we control for U , Y3 is still correlated with Y5 through the dynamic channel along the horizontal arrows. In order to shut out this dynamic channel, we control the intermediate state Y4 between Y3 and Y5 . Using the language of the causal inference literature, we say that (U, Y4 ) “d-separates” Y3 and Y5 in the causal diagram below, and this d-separation implies the conditional independence restriction Y3 ⊥⊥ Y5 | (U, Y4 ); see Pearl (2000). Therefore Y3 is now a genuine proxy of U conditionally on the fixed Y4 to analyze the dynamic model g. Similarly, we control the intermediate state Y2 between Y1 and Y3 to analyze the initial condition FY1 U . The causal diagram indicates that (U, Y2 ) “d-separates” Y1 and Y3 , hence Y3 ⊥⊥ Y1 | (U, Y2 ). This makes Y3 a genuine proxy of U conditionally on the fixed Y2 to analyze the initial condition FY1 U . Controlling for the two adjacent states, Y2 and Y4 , costs the consecutive three periods (Y2 , Y3 , Y4 ) for the constructed proxy model Qy4 ,z,y2 . This is an intuition behind the requirement of three additional periods for identification without an outside proxy variable. See Section 10.2 for a formal proof. 17 U Q m m vmv  // HHHQ Q m mvvvv  v // HHH Q Q m // HH Q Q m m vv  / HH Q Q v HH ?> ==< g QQ m v  / HHH m v  // H Q Q m vv  // HH Q Q m v H m m vvv  // HH Q Q m m vv  HH_ _  Q Q vm z v $ _ _ _ /( Y 6 / Y2 Z / || / Y4 / Y5 _ _ _ 89:; ?>=< Y1 g g g g   g Y3 _ _ Qy4 ,z,y2 Py5 Y3 as a proxy for U conditionally on (Y2 , Y4 ) Dynamic model fY6 |Y5 U Figure 1.2. Causal diagram for six periods. Discussion 3. The idea of using three additional time periods to construct a proxy variable parallels the well-known idea of using lags as instruments to form identifying restrictions for additively separable dynamic panel data models (e.g., An- derson and Hsiao, 1982; Arellano and Bond, 1991; Ahn and Schmidt, 1995; Arellano and Bover, 1995; Blundell and Bond, 1998; Hahn, 1999). Because projection or mo- ment restriction on instruments is not generally a viable option for nonseparable and nonparametric models, the literature on nonseparable cross section models proposes constructing a proxy variable or a control variable from instruments (Garen, 1984; Florens, Heckman, Meghir, and Vytlacil, 2008; Imbens and Newey, 2009). Altonji and Matzkin (2005) show that a control variable can also be constructed from panel data for sibling and neighborhood panels. This paper proposes constructing a proxy variable from three additional observations of dynamic panel data, similar to Altonji and Matzkin’s construction of a control variable from sibling and neighborhood pan- els. The constructed proxy variable turns out to account for not only nonseparable heterogeneity but also selection as argued above. 4. Identification This section formalizes the identification result, a part of which is sketched in Section 3. 18 4.1. Identifying Restrictions. Identification is proved by showing the well- definition of the inverse DGP correspondence (FY2 Y1 ZD1 ( ·, ·, ·, 1), FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1)) 7→ (g, h, FY1 U , ζ, FEt , FVt , FW ), up to observational equivalence classes represented by a certain normalization of the error distributions FEt , FVt , and FW a la Matzkin (2003). To this end, we invoke the following four restrictions on the set of potential data-generating models. Restriction 1 (Representation). Each of the functions g, h, and ζ is non- decreasing and c`agl`ad (left-continuous with right limit) in the last argument. The distributions of Et , Vt , and W are absolutely continuous with convex supports, and each of {Et }t and {Vt }t is identically distributed across t. The weak – as opposed to strict – monotonicity of the functions with respect to idiosyncratic errors accommodates discrete outcomes Yt , Dt , and Z under absolutely continuous distributions of errors (Et , Vt , W ). The purpose of Restriction 1 is to construct representations of the equivalence classes of nonseparable functions up to which g, h, and ζ are uniquely determined by the distributions FYt |Yt−1 U , FDt |Yt U , and FZ|U , respectively. The independence restriction stated below in addition to Restriction 1 allows for their quantile representations in particular. Restriction 2 (Independence). (i) Exogeneity of Et : Et ⊥⊥ (U, Y1 , {Es }s 0 such that 0 < fZ|U (1 | u) 6 1 − δ for all u. Relevant Proxy: fZ|U (1 | u) ̸= fZ|U (1 | u′ ) whenever u ̸= u′ . (iii) No Extinction: fD2 |Y2 U (1 | y, u) > 0 for all u ∈ U. (iv) Initial Heterogeneity: the integral operator Sy : L2 (FYt ) → L2 (FU ) defined by ∫ Sy ξ(u) = fY2 Y1 U D1 U (y, y ′ , u, 1) · ξ(y ′ )dy ′ is bounded and invertible. Under the special case discussed in Section 3.1, Restriction 3 is equivalent to (8)– (12), by which the dynamic function g and the proxy model ζ were identified in that section. The notion of ‘invertibility’ depends on the normed linear spaces on which the operators are defined. We use L2 in order to exploit convenient properties of the Hilbert spaces.19 A bounded linear operator between Hilbert spaces guarantees existence and uniqueness of its adjoint operator, which of course presumes a pre- Hilbert space structure in particular. Moreover, the invertibility guarantees that the adjoint operator is also invertible, which is an important property used to derive identification of the selection rule h and initial condition FY1 U . Andrews (2011) shows 19Carrasco, Florens, and Renault (2007) review some important properties of operators on Hilbert spaces. 20 that a wide variety of injective operators between L2 spaces can be constructed from an orthonormal basis, and that the completeness assumption ‘generically’ holds. The first part of the rank condition (ii) requires that the proxy model (ζ, FW ) exhibits nondegeneracy. The second part of the rank condition (ii) requires that Z is a relevant proxy for unobserved heterogeneity U , as characterized by distinct proxy scores fZ|U (1 | u) across u. The rank condition (iii) requires that there continue to exist some survivors in each heterogeneous type, hence no type U goes extinct. This restriction is natural because one cannot learn about a dynamic structure of the group of individuals that goes extinct after the first time period. Restriction 4 (Labeling of U in Nonseparable Models). u ≡ fZ|U (1 | u) for all u ∈ U. Due to its unobservability, U has neither intrinsic values nor units of measure- ment. This is a reason for potential non-uniqueness of fully nonseparable functions. The purpose of Restriction 4 is to attach concrete values to unobserved heterogeneity U ; see also Hu and Schennach (2008). Restriction 4 is innocuous in nonseparable models in the sense that identification is considered up to observational equivalence g(y, u, ε) ≡ gπ (y, π(u), ε) for any permutation π of U. On the other hand, this re- striction is redundant and too stringent for additively separable models, in which U has the same unit of measurement as Y by construction. In the latter case, we can replace Restriction 4 by the following alternative labeling assumption. Restriction 4′ (Labeling of U in Separable Models). u ≡ E[g(y, u, E)] − g˜(y) for all u ∈ U and y ∈ Y for some function g˜. This alternative labeling restriction is innocuous for separable models in the sense that it is automatically satisfied by additive models of the form g(y, u, ε) = g˜(y) + u + ε with E[E] = 0. 21 4.2. Representation. Nonparametric identification of nonseparable functions is generally feasible only up to some equivalence classes (e.g., Matzkin, 2003, 2007). Representations of these equivalence classes are discussed as a preliminary step toward identification. Restrictions 1 and 2 allow representations of functions g, h, and ζ by normalizing the distributions of the independent errors. Lemma 1 (Quantile Representations of Non-Decreasing C`agl`ad Functions). (i) Suppose that Restrictions 1 and 2 (i) hold. Then FY3 |Y2 U uniquely determines g up to the observational equivalence classes represented by the normalization Et ∼ Uniform(0, 1). (ii) Suppose that Restrictions 1 and 2 (ii) hold. Then FD2 |Y2 U uniquely determines h up to the observational equivalence classes represented by the normalization Vt ∼ Uniform(0, 1). (iii) Suppose that Restrictions 1 and 2 (iii) hold. Then FZ|U uniquely determines ζ up to the observational equivalence classes represented by the normalization W ∼ Uniform(0, 1). A proof is given in Section 8.1. The representations under these assumptions and normalizations are established by the respective quantile regressions: g(y, u, ε) = FY−1 3 |Y2 U (ε | y, u) := inf{y ′ | ε ≤ FY3 |Y2 U (y ′ | y, u)} ∀(y, u, ε) h(y, u, v) = FD−12 |Y2 U (v | y, u) := inf{d | v ≤ FD2 |Y2 U (d | y, u)} ∀(y, u, v) −1 ζ(u, w) = FZ|U (w | u) := inf{z | w ≤ FZ|U (z | u)} ∀(u, w) The non-decreasing condition in Restriction 1 is sufficient for almost-everywhere equivalence of the quantile representations. Furthermore, we also require the c`agl`ad condition of Restriction 1 for point-wise equivalence of the quantile representations. Given Lemma 1, it remains to show that the observed distributions FY2 Y1 ZD1 ( ·, ·, · , 1) and FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1)) uniquely determine (FY3 |Y2 U , FD2 |Y2 U , FZ|U ) as well as FY1 U . 22 4.3. The Main Identification Result. Section 4.2 shows that FY3 |Y2 U , FD1 |Y1 U , and FZ|U uniquely determine g, h, and ζ, respectively, up to the aforementioned equivalence classes. Therefore, the model (g, h, FY1 U , ζ) can be identified by showing the well-definition of the inverse DGP correspondence (FY2 Y1 ZD1 ( ·, ·, · , 1), FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1)) 7→ (FY3 |Y2 U , FD2 |Y2 U , FY1 U , FZ|U ). Lemma 2 (Identification). Under Restrictions 1, 2, 3, and 4, the quadruple (FY3 |Y2 U , FD2 |Y2 U , FY1 U , FZ|U ) is uniquely determined by FY2 Y1 ZD1 ( ·, ·, · , 1) and FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1). Combining Lemmas 1 and 2 yields the following main identification result of this paper. Theorem 1 (Identification). Under Restrictions 1, 2, 3, and 4, the model (g, h, FY1 U , ζ) is identified by FY2 Y1 ZD1 ( ·, ·, ·, 1) and FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1) up to the equivalence classes represented by the normalizations Et , Vt , W ∼ Uniform(0, 1). A proof of Lemma 2 is given in Section 8.2, and consists of six steps of spectral decompositions, operator inversions (solving Fredholm equations of the first kind), and algebra. Figure 3.1 illustrates how observed data uniquely determines the model (g, h, FY1 U , ζ) through the six steps. Section 3.1 provided an informal sketch of the proof of the first step among others. Remark 2. While the baseline model only induces a permanent dropout through Dt = 0, we can also allow for an entry through Dt = 1. See Section 10.1. This is useful to model entry of firms as well as reentry of female workers into the labor market after child birth. This extension also accommodates general unbalanced panel data with various causes of selection. Remark 3. Three additional time periods can be used as a substitute for a nonclassical proxy variable. In other words, T = 6 periods of panel data alone identifies the model, and a proxy variable is optional. See Section 3.2 and Section 10.2. 23 Remark 4. The baseline model consists of the first-order Markov process. Section 10.3 generalizes the baseline result to allow for higher-order lags in the functions g and h. Generally, τ + 2 periods of unbalanced panel data identifies the model with τ -th order Markov process g and (τ −1)-st order Markov decision rule h. The baseline model is a special case with τ = 1. Remark 5. The baseline model only allows individual fixed effects U . Suppose that the dynamic model gt involves time effects, for example, to reflect common macroeconomic shocks to income dynamics. Then the model ({gt }Tt=2 , h, FY1 U , ζ) can be identified by T + 1 periods of unbalanced panel data from t = 0 to t = T > 3. See Section 10.4. Remark 6. The rule for missing observations was defined in terms of a lagged selection indicator Dt−1 which depends on Yt−1 . Data may instead be selected based on contemporaneous Dt which depends on Yt . See Section 10.5. Note that, in the latter case, we may not observe Yt based on which the data is selected. These two selection criteria reflect the ex ante versus ex post Roy selection processes by rational agents. Remark 7. Restriction 3 (i) implies the cardinality relation |supp(U )| 6 |supp(Yt )|. This cardinality restriction in particular rules out binary Yt with continuously dis- tributed U . Furthermore, the relevant proxy in Restriction 3 (ii) implies dim(U ) 6 1. Remark 8. For the result using a proxy variable, the notation appears to suggest that a proxy is time-invariant. However, a time-variant proxy Zt = ζ(U, Wt ) may also be used as far as Wt satisfies the same independence restriction as W . 5. Estimation The identification result is derived through six steps of spectral decompositions, operator inversions (solutions to Fredholm equations of the first kind), and algebra, as illustrated in Figure 3.1. A sample-analog or plug-in estimation following all these 24 steps is practically infeasible. The present section therefore discusses how to turn this six-step procedure into a one-step procedure. 5.1. Constrained Maximum Likelihood. After showing nonparametric iden- tification as in Section 4, one can generally proceed with the maximum likelihood estimation of parametric or semi-parametric sub-models. In our context, however, the presence of missing observations biases the standard maximum likelihood estima- tor. In this section, we apply the Kullback-Leibler information inequality to translate our main identification result (Lemma 2) into an identification-preserving criterion function, which is robust against selection or missing data. Because the model (g, h, FY1 U , ζ) is represented by (FYt |Yt−1 U , FDt |Yt U , FY1 U , FZ|U ) (see Lemmas 1 and 4), we use F to denote the set of all the admissible model repre- sentations: F = {(FYt |Yt−1 U , FDt |Yt U , FY1 U , FZ|U ) | (g, h, FY1 U , ζ) satisfies Restrictions 1, 2, 3, and 4}. As a consequence of the main identification result, the quadruple for the true model (FY∗t |Yt−1 U , FD∗ t |Yt U , FY∗1 U , FZ|U ∗ ) can be characterized by the following criterion, allowing for a one-step plug-in estimator. Corollary 1 (Constrained Maximum Likelihood). If the quadruple for the true model (FY∗t |Yt−1 U , FD∗ t |Yt U , FY∗1 U , FZ|U ∗ ) is an element of F, then it is the unique solution to [ ∫ ( max ) c1 E log fYt |Yt−1 U (Y2 | Y1 , u)fDt |Yt U (1 | Y1 , u) FY |Y ,FD |Y U ,FY1 U ,FZ|U ∈F t t−1 U t t ] fY1 U (Y1 , u)fZ|U (Z | u)dµ(u) D1 = 1 + [ ∫ c2 E log fYt |Yt−1 U (Y3 | Y2 , u)fYt |Yt−1 U (Y2 | Y1 , u)fDt |Yt U (1 | Y2 , u)fDt |Yt U (1 | Y1 , u) ] fY1 U (Y1 , u)fZ|U (Z | u)dµ(u) D2 = D1 = 1 for any c1 , c2 > 0 subject to ∫ fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y1 , u) = fD1 (1) and ∫ fYt |Yt−1 U (y2 | y1 , u)fDt |Yt U (1 | y2 , u)fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y2 , y1 , u) = fD2 D1 (1, 1). 25 A proof is found in Section 9.1. The sense of uniqueness stated in the corollary is up to the equivalence classes identified by the underlying probability measures. Once a representing model (FYt |Yt−1 U , FDt |Yt U , FY1 U , FZ|U ) is parametrically or semi- /non-parametrically specified, the sample analog of the objective and constraints can be formed from observed data. The first term in the objective can be estimated since (Y2 , Y1 , Z) is observed conditionally on D1 = 1. Similarly, the second term can be estimated since (Y3 , Y2 , Y1 , Z) is observed conditionally on D2 = D1 = 1. All the components in the two constraints are also computable from observed data since fD1 (1) and fD2 D1 (1, 1) are observable. This criterion is related to the maximum likelihood. The objective consists of a convex combination of expected log likelihoods conditional on survivors. Using this objective alone therefore would incur a survivorship bias. To adjust for the selection bias, the constraints bind the model to correctly predict the observed selection prob- abilities. Any pair of positive values may be chosen for c1 and c2 . However, there is a certain choice of these coefficients that makes the constrained optimization problem easier, as discussed in the following remark. Remark 9. Solutions to constrained optimization problems like Corollary 1 are characterized by saddle points of the Lagrangean functional. Although it appears easier than the original six-step procedure, this saddle-point problem over a function space is still practically challenging. By an appropriate choice of c1 and c2 , we can, however, turn this saddle point problem into an unconstrained maximization prob- lem. Let λ1 and λ2 denote the Lagrange multipliers for the two constraints in the corollary. Under some regularity conditions (Fr´echet differentiability of the objective and constraint functionals, differentiability of the solution to the selection probabil- ity, and the regularity of the solution for the constraint functionals), the choice of c1 = Pr(D1 = 1) and c2 = Pr(D2 = D1 = 1) guarantees λ∗1 = λ∗2 = 1 at the optimum (see Section 9.2). With this knowledge of the values of λ∗1 and λ∗2 , the solution to the problem in the corollary can now be characterized by a maximum rather than a saddle point. This fact is useful both for implementation of numerical solution methods and 26 for availability of the existing large sample theories of parametric, semiparametric, and nonparametric M -estimators. Remark 10. In case of using T = 6 periods of unbalanced panel data instead of a proxy variable, a similar one-step criterion to Corollary 1 can be derived. See Section 10.2. 5.2. An Estimator. The six steps of the identification strategy do not admit a practically feasible plug-in estimator. On the other hand, Corollary 1 and Remark 9 together yield the standard M -estimator by the sample analog. We decompose the model set as F = F1 × F2 × F3 × F4 , where F1 , F2 , F3 , and F4 are sets of parametric or semi-/non-parametric models for fYt |Yt−1 U , fDt |Yt U , fY1 U , and fZ|U , respectively. Accordingly, we denote an element of F by f = (f1 , f2 , f3 , f4 ) for brevity. With this notation, Corollary 1 and Remark 9 imply that the estimator of the true model f0 can be characterized by a solution fˆ to the maximization problem: 1∑ n max l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; f ) f ∈Fk(n) n i=1 for some sieve space Fk(n) = F1,k1 (n) × F2,k2 (n) × F3,k3 (n) × F4,k4 (n) ⊂ F, where l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; f ) := 1 {Di1 = 1} · l1 (Yi2 , Yi1 , Zi ; f ) +1 {Di2 = Di1 = 1} · l2 (Yi3 , Yi2 , Yi1 , Zi ; f ) − l3 (f ) − l4 (f ), ∫ l1 (Yi2 , Yi1 , Zi ; f ) := log f1 (Yi2 | Yi1 , u)f2 (1 | Yi1 , u)f3 (Yi1 , u)f4 (Zi | u)dµ(u), ∫ l2 (Yi3 , Yi2 , Yi1 , Zi ; f ) := log f1 (Yi3 | Yi2 , u)f1 (Yi2 | Yi1 , u)f2 (1 | Yi2 , u)f2 (1 | Yi1 , u) × f3 (Yi1 , u)f4 (Zi | u)dµ(u), ∫ l3 (f ) := f2 (1 | y1 , u)f3 (y1 , u)dµ(y1 , u), and ∫ l4 (f ) := f1 (y2 | y1 , u)f2 (1 | y2 , u)f2 (1 | y1 , u)f3 (y1 , u)dµ(y2 , y1 , u). Note that Yi3 and Yi2 may be missing in data, but the interactions with the indicators 1 {Di1 = 1} and 1 {Di2 = Di1 = 1} allow the expression l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; f ) to make sense even if they are missing. 27 Besides the identifying Restrictions 1, 2, 3, and 4 for the model set F, we require additional technical assumptions, stated in the appendix for brevity of exposition, to guarantee a well-behaved estimator in large samples. Proposition 1 (Consistency). Suppose that F satisfies Restrictions 1, 2, 3, 4, and the assumptions under Section 9.2. If, in addition, Assumptions 1, 2, and 3 in Section 9.3 restrict the model set F, choice of the sieves {Fk(n) }∞ n=1 , and the data FY3 Y2 Y1 ZD2 D1 , then fˆ − f0 = op (1) holds, where this norm ∥·∥ is defined in Section 9.3. The estimator can also be adapted to semi-parametric and parametric sub-models which can be more relevant for empirical analysis. Section 9.4 introduces a semi- parametric estimator and its asymptotic distribution. Parametric models may be estimated with the standard M -estimation theory. 5.3. Monte Carlo Evidence. This section shows Monte Carlo evidence to eval- uate the estimation method proposed in this paper. The endogenously unbalanced panel data of N = 1, 000 and T = 3 are generated using the following DGP:    Yt = α1 Yt−1 + U + Et  Et ∼ Normal(0, α2 )       Dt = 1{β0 + β1 Yt + β2 U + Vt ≥ 0} Vt ∼ Logistic(0, 1)          γ1   γ32 γ3 γ4 γ5    FY1 U (Y1 , U ) ∼ Normal ,      γ2 γ3 γ4 γ5 γ42      Z = 1{δ + δ U + W ≥ 0} 0 1 W ∼ Normal(0, 1) Monte Carlo simulation results of the constrained maximum likelihood estimation are displayed in the first four rows of Table 1.1. The first row shows simulated dis- tributions of parameter estimates by the fully-parametric estimation using the true model. The inter-quartile ranges capture the true parameter value of 0.5 without suf- fering from attrition bias. The second row shows simulated distributions of parameter estimates by semiparametric estimation, where the distribution of FY1 U is assumed to be semiparametric with normality of the conditional distribution FU |Y1 . The third 28 True Parameter Values: α1 = α2 = β0 = β1 = β2 = 0.5 Dynamic Model Hazard Model Percentile α ˆ1 α ˆ2 βˆ0 βˆ1 βˆ2 Parametric CMLE 75% 0.556 0.519 0.673 0.855 1.293 50% 0.502 0.502 0.517 0.523 0.569 25% 0.454 0.483 0.403 0.169 −0.182 Semi-parametric∗ CMLE 75% 0.566 0.524 0.758 0.882 1.212 50% 0.513 0.508 0.555 0.523 0.475 25% 0.465 0.489 0.428 0.153 −0.288 Semi-parametric∗∗ CMLE 75% 0.558 — — 0.589 50% 0.459 — — 0.418 0.500 25% 0.368 — — 0.204 (Fixed) Semi-parametric∗∗∗ CMLE 75% 0.686 — — 1.049 50% 0.436 — — 0.719 0.500 25% 0.271 — — 0.403 (Fixed) † Semi-parametric 1st Step 75% 0.585 — — — — 50% 0.495 — — — — 25% 0.385 — — — — Arellano-Bond 75% 0.464 — — — — 50% 0.412 — — — — 25% 0.352 — — — — Fixed-Effect Logit 75% — — — −0.134 — 50% — — — −0.287 — 25% — — — −0.441 — Random-Effect Logit 75% — — — 0.793 — 50% — — — 0.729 — 25% — — — 0.672 — ∗ The distribution of FY1 U is semi-parametric. ∗∗ The distributions of Et and Vt are nonparametric. ∗ ∗ ∗ The distribution of FY1 U is semi-parametric, and the distributions of Et and Vt are nonparametric. † The distribution (Et , Vt , Y1 , U ) and the functions g and h are nonparametric. Table 1.1. MC-simulated distributions of parameter estimates. 29 row shows simulated distributions of parameter estimates by semiparametric estima- tion, where the distributions of Et and Vt are assumed to be unknown. The fourth row shows simulated distributions of parameter estimates by semiparametric estima- tion combining the above two semiparametric assumptions. While the medians are slightly off the true values, the inter-quartile ranges again capture the true parameter value of 0.5. If one is interested in only the dynamic model g, then the sample-analog estimation of the first step in the six-step identification strategy can be used instead of the constrained maximum likelihood. With the notations from Section 3.1, minimizing ˆ −1 ˆ y,1 L ρ(L −1 −1 y,0 , Py (α)Q1 Q0 Py (α)) for some divergence measure or a metric ρ yields the first-step estimation. Using the square-integrated difference for ρ, the fifth row of Table 1.1 shows a semiparametric first-step estimation for the dynamic model g. The interquartile range of the MC-simulated distribution indeed captures the true value of 0.5, and it is reasonably tight for a semiparametric estimator. But why is the interquartile range of this first-step estimator tighter than those of the CMLE in the first four rows, despite the greater degree of nonparametric specification? Recall that the first-step estimation uses only two semi-/non-parametric functions (g, ζ) because the other elements have been nonparametrically differenced out. This is to be contrasted with the CMLE, which uses all the four semi-/non-parametric functions (g, h, FY1 U , ζ). The first-step estimator therefore uses less sieve spaces than the CMLE, and incurs smaller mean square errors in finite sample. If there were no missing observations from attrition, existing methods such as Arel- lano and Bond (1991) would consistently estimate α1 . Similarly, because Vt follows the √ logistic distribution, the fixed-effect logit method (which is the only N -consistent binary response estimator in particular; Chamberlain, 2010) would consistently esti- mate β1 if the counterfactual binary choice of dynamic selection were observable after attrition. However, missing data from attrition causes these estimators to be biased as shown in the bottom three rows of Table 1.1. Observe that the fixed effect logit estimator is not only biased, but the sign is even opposite to the truth. This fact 30 evidences that ignorance of selection could lead to a misleading result, even if the true parametric and distributional model is known. 6. Empirical Illustration: SES and Mortality 6.1. Background. A large number of biological and socio-economic elements help to explain mortality (Cutler, Deaton, and Lleras-Muney, 2006). Among others, measures of socioeconomic status (SES) including earnings, employment, and income are important, yet puzzling as presumable economic determinants of mortality. The literature has reached no consensus on the sign of the effects of these measures of SES on mortality. On one hand, higher SES seems to play a malignant role. For example, at the macroeconomic unit of observations, recessions reduce mortality (Ruhm, 2000). For another example, higher social security income induces higher mortality (Snyder and Evans, 2006). On the other hand, higher SES has been reported to play a protective role. Deaton and Paxson (2001) and Sullivan and von Wachter (2009a) show that higher income reduces mortality. Job displacement, which results in a substantial drop in income, induces higher and long-lasting mortality (Sullivan and von Wachter, 2009b). The apparent discrepancy of the signs may be reconciled by the fact that these studies consider different sources of income and different units of observations. A major concern in empirical analysis is the issue of endogeneity. Design-based empirical analysis often provides a solution. However, while non-labor income may allow exogenous variations to facilitate natural and quasi-experimental studies (e.g., Snyder and Evans, 2006), labor outcome is often harder to control exogenously. An alternative approach is to explicitly control the common factors that affect both SES and mortality. Education, in particular, is an important observable common factor, e.g., Lleras-Muney (2005) reports causal effects of education on adult mortality. Controlling for this common factor may completely remove the effect of income on mortality. For example, Adams, Hurd, McFadden, Merrill, and Ribeiro (2003) show that income conditional on education is not correlated with health. 31 Indirect Effects of Heterogeneity via Socioeconomic Status Dt Yt _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _/ Socioeconomic Status Mortality in Adulthood eK 9 KK sss KK sssss s Indirect Effects Direct Effects K of Heterogeneity of Heterogeneity KK sss KK sss sss U Early Human Capital Figure 1.3. Causal relationships among early human capital, socioe- conomic status, and mortality in adulthood. Education may play a substantial role, but it may not be the only common factor that needs to be controlled for. A wide variety of early human capital (HC) besides those reflected on education is considered to affect SES and/or adult health in the long run. Case, Fertig, and Paxson (2005) report long-lasting direct and indirect effects of childhood health on health and well-being in adulthood. Maccini and Yang (2009) find that the natural environment at birth affects adult health. Almond and Mazumder (2005) and Almond (2006) show evidence that HC acquired in utero affects long-run health. Early HC could contain a wide variety of categories of HC, such as genetic expression, acquired health, knowledge, and skills, all of which develop in an interactive manner with inter-temporal feedbacks during childhood (e.g., Heckman, 2007; Cunha and Heckman, 2008; Cunha, Heckman, and Schennach, 2010). Failure to control for these components of early HC would result in identification of “spurious dependence” (e.g., Heckman, 1981ab, 1991). Early HC may directly affect adult mortality via the development of chronic conditions in childhood. Early HC may also affect earnings, which may in turn affect adult mortality, as illustrated below. Identification of these two competing causal effects of Yt and U on Dt , or dis- tinction between the two channels in the above diagram, requires to control for the unobserved heterogeneity of early HC. Unlike education, however, most components of early HC are unobservable from the viewpoint of econometricians. Suppose that early HC develops into fixed characteristics by adulthood. How can we control for these heterogeneous characteristics? Because of the strong cross-section correlation 32 between SES and these heterogeneous characteristics, a variation over time is useful to disentangle their competing effects on mortality, e.g., Deaton and Paxson (2001) and Sullivan and von Wachter (2009a). I extend these ideas by treating early HC as a fixed unobserved heterogeneity to be distinguished from time-varying observed mea- sures of SES. To account for both nonseparable heterogeneity and the survivorship bias, I use the econometric method developed in this paper. 6.2. Empirical Model. Sullivan and von Wachter (2009b) show elaborate ev- idence on the malignant effects of job displacement on mortality, carefully ruling out the competing hypothesis of selective displacement. Sullivan and von Wachter (2009a), focusing on income as a measure of SES, find that there are protective effects of higher SES on mortality whereas there is no or little evidence of causal effects of unobserved attributes such as patience on mortality. Using the econometric meth- ods developed in this paper, I attempt to complement these analyses by explicitly modeling unobserved heterogeneity and survival selection. The following econometric model represents the causal relationship described in the above diagram.      (i) Yit = g(Yi,t−1 , Ui , Eit ) SES Dynamics      (ii) Dit = h(Yit , Ui , Vit ) Survival Selection     (iii) FY1 U Initial Condition      (iv) Zi = ζ(Ui , Wi ) Nonclassical Proxy where Yit , Dit , and Ui denote SES, survival, and unobserved heterogeneity, respec- tively. As noted earlier, the heterogeneity U reflects early human capital (HC) ac- quired prior to the start of the panel data, which play the role of sustaining employ- ment dynamics in model (i). This early HC may include acquired and innate abilities, knowledge, skills, patience, diligence, and chronic health conditions, which may af- fect the survival selection (ii) as well as the income dynamics. The initial condition 33 (iii) models a statistical summary of the initial observation of SES that has devel- oped cumulatively and dependently on the early HC prior to the first observation by econometrician. For this empirical application, we consider the model in which all the random variables are binary as in Section 3. Specifically, Yit indicates that individual i is (0) unemployed or (1) employed, Dit indicates that individual i is (0) dead or (1) alive, and Ui indicates that individual i belongs to (0) type I or (1) type II. Several proxy variables are used for Zi as means of showing robustness of empirical results. The heterogeneous type U does not yet have any intrinsic meaning at this point, but it turns out empirically to have a consistent meaning in terms of a pattern of employment dynamics as we will see in Section 6.4. Besides unobserved heterogeneity, other main sources of endogeneity in analysis of SES and mortality are cohort effects and age effects. In parametric regression analysis, one can usually control for these effects by inserting additive dummies or polynomials of age. Since additive controls are infeasible for our setup of nonseprable models, we implement the econometric analysis for each bin of age categories in order to mitigate the age and cohort effects. 6.3. Data. The NLS Original Cohorts: Older Men consist of 5,020 individuals aged 46–60 as of April 1, 1966. The subjects were surveyed annually or biennially starting in 1966. Attrition is frequent in this panel data. In order for the selection model to exactly represent the survival selection, we remove those individuals with attrition due to reasons other than death. It is important to rule out competing hypotheses that obscure the credibility of our empirical results. For example, health as well as wealth is an important factor of retirement deicision (Bound, Stinebrickner, and Waidmann, 2010). It is not unlikely that individuals who have chosen to retire from jobs for health problems subsequently die. If so, we would erroneously impute death to voluntary retirements. To eliminate this confounding factor, we consider the subsample of individuals who reported that 34 health problems do not limit work in 1971. Furthermore, we also consider the sub- sample of individuals who died from acute diseases such as heart attacks and strokes, because workers dying unexpectedly from acute diseases are less likely to change la- bor status before a sudden death than those who die from cancer or diabetes. Death certificates are used to classify causes of deaths to this end. Recall that the econometric methods presented in this paper offer two paths of identification. One is to use a panel of T = 3 with a nonclassical proxy variable, and the other is to use a panel of T = 6 without a proxy. While the the survey was conducted at more than six time points, the list of survey years do not exhibit equal time intervals (1966, 67, 68, 69, 71, 73, 75, 76, 78, 80, 81, 83, and 90). None of annual or biennial sequences consist consecutive six periods from this anomalistic list of years. Therefore we choose the method of proxy variables. Because one of the proxy variables is collected only once in 1973, we need to set T=1 or T=2 to year 1973 in order to satisfy the identifying restriction. We thus set T = 2 to year 1973 to exploit a larger size of data, hence using the three-period data from years 71, 73, and 75 in our analysis. The subjects are aged 51–65 in 1971, but we focus on the younger cohorts not facing the retirement age. We use height, mother’s occupation, and father’s occupation, as potential candi- dates for proxy variables. Height reflects health investments in childhood (Schultz, 2002). Mother’s education and father’s occupation reflect early endowments and in- vestments in human capital in the form of intergenerational inheritance; e.g., Currie and Moretti (2003) show evidence of intergenerational transmission of human capital. We use these three proxies to aim to show robustness of our empirical results. 6.4. Empirical Results. Table 1.2 summarizes estimates of the first-order pro- cess of employment dynamics and the conditional two-year survival probabilities using height as a proxy variable. The top and bottom panels correspond to younger cohorts (aged 51–54 in 1971) and older cohorts (aged 55–58 in 1971), respectively. The left and right columns correspond to Type I (Ui = 0) and Type II (Ui = 1), respec- tively. These unobserved types exhibit a consistent pattern: off-diagonal elements 35 Birth year cohorts 1917–1920 (aged 51–54 in 1971) N = 822 Type I (U = 0) Type II (U = 1) 54.2% 45.8% Markov Yt−1 Markov Yt−1 Matrix 0 1 Matrix 0 1 0 0.930 0.128 0 1.000 0.025 Yt Yt 1 0.070 0.872 1 0.000 0.975 2-Year Survival Probability 2-Year Survival Probability 0 0.899 (0.044) 0 0.878 (0.149) Yt Yt 1 1.000 (0.000) 1 0.999 (0.038) H0 : equal probability H0 : equal probability p-value = 0.021∗∗ p-value = 0.445 Birth year cohorts 1913–1916 (aged 55–58 in 1971) N = 727 Type I (U = 0) Type II (U = 1) 53.9% 46.1% Markov Yt−1 Markov Yt−1 Matrix 0 1 Matrix 0 1 0 0.954 0.162 0 1.000 0.066 Yt Yt 1 0.046 0.838 1 0.000 0.934 2-Year Survival Probability 2-Year Survival Probability 0 0.912 (0.036) 0 0.890 (0.107) Yt Yt 1 1.000 (0.000) 1 0.983 (0.042) H0 : equal probability H0 : equal probability p-value = 0.013∗∗ p-value = 0.416 Table 1.2. Model estimates with height as a proxy. 36 Birth Year Cohorts 1917–1920 (aged 51–54 in 1971) Birth Year Cohorts 1913–1916 (aged 55–58 in 1971) Figure 1.4. Markov probabilities of employment in the next two years. of the employment Markov matrices for Type I dominate those of Type II. In other words, Type I and Type II can be characterized as movers and stayers, respectively. In view of the survival probabilities in the top panel (young cohorts), we find that individuals almost surely stay alive as far as they are employed. On the other hand, the two-year survival probabilities drop by about 10% if individuals are unemploed. While the data indicates statistical significance of their difference only for Type I, the magnitudes of differences in the point estimates are almost identical between the two types. The same qualitative pattern persists in the older cohorts. To show a robustness of this baseline result, we repeat this estimation using the other two proxy variables, mother’s education and father’s occupation. Figure 1.4 graphs estimates of Markov probabilities of employment. The shades in the bars indi- cate different proxy variables used for estimation. We see that these point estimates are robust across the three proxy variables, implying that a choice of a particular proxy does not lead to an irregular result in favor of certain claims. Table 1.5 graphs estimates of conditional two-year survival probabilities. Again, the point estimates are robust across the three proxy variables. 37 Birth Year Cohorts 1917–1920 (aged 51–54 in 1971) Birth Year Cohorts 1913–1916 (aged 55–58 in 1971) Figure 1.5. Conditional survival probabilities in the next two years. As mentioned earlier, selective or voluntary retirement is a potential source of bias. To rule out this possibility, we consider two subpopulations: 1. those individu- als who reported that health problems do not limit their work in 1971; and 2. those individuals who eventually died from acute diseases. Figures 1.6 and 1.7 show esti- mates for the first subpopulation. Figures 1.8 and 1.9 show estimates for the second subpopulation. Again, robustness across the three proxies persists, and the qualita- tive pattern remains the same as the baseline result. The relatively large variations in the estimates for the second subpopulation is imputed to small sample sizes due to the limited availability of death certificates from which we identify causes of deaths. In summary, we obtain the following two robust results. First, accounting for unobserved heterogeneity and survivorship bias as well as voluntary retirements, em- ployment status has protective effects on survival selection. This reinforces the results of Sullivan and von Wachter (2009b). Second, there is no evidence of the effects of unobserved attributes on survival selection, since the conditional survival probabili- ties are almost the same between type I and type II. This is in accord with the claim 38 Birth Year Cohorts 1917–1920 (aged 51–54 in 1971) Birth Year Cohorts 1913–1916 (aged 55–58 in 1971) Figure 1.6. Markov probabilities of employment in the next two years among the subpopulation of individuals who reported health problems that limit work in 1971. Birth Year Cohorts 1917–1920 (aged 51–54 in 1971) Birth Year Cohorts 1913–1916 (aged 55–58 in 1971) Figure 1.7. Conditional survival probabilities in the next two years among the subpopulation of individuals who reported health problems that limit work in 1971. 39 Birth Year Cohorts 1917–1920 (aged 51–54 in 1971) Birth Year Cohorts 1913–1916 (aged 55–58 in 1971) Figure 1.8. Markov probabilities of employment in the next two years among the subpopulation of individuals who eventually died from acute diseases according to death certificates. Birth Year Cohorts 1917–1920 (aged 51–54 in 1971) Birth Year Cohorts 1913–1916 (aged 55–58 in 1971) Figure 1.9. Conditional survival probabilities in the next two years among the subpopulation of individuals who eventually died from acute diseases according to death certificates. 40 of Sullivan and von Wachter (2009a), who deduce that lagged SES has little effect on mortality conditionally on the SES of immediate past. Using the estimated Markov model g and the estimated initial condition FY1 U , we can simulate the counterfactual employment rates assuming that all the individuals were to remain alive throughout the entire period. Figure 1.10 shows actual employ- ment rates (black lines) and counterfactual employment rates (grey lines) for each cohort category for each proxy variable. I again remark that the qualitative patterns are the same across three proxy variables for each cohort category. Not shockingly, if it were not for deaths, the counterfactual employment rates would have been even lower than what we observed from actual data. In other words, deaths of working age population are saving the actual figures of employment rates to look higher. 7. Summary This paper proposes a set of nonparametric restrictions to point-identify dynamic panel data models by nonparametrically differencing out both nonseparable hetero- geneity and selection. Identification requires either T = 3 periods of panel data and a proxy variable or T = 6 periods of panel data without an outside proxy variable. As a consequence of the identification result, the constrained maximum likelihood crite- rion follows, which corrects for selection and allows for one-step estimation. Monte Carlo simulations are used to evidence the effectiveness of the estimators. In the empirical application, I find protective effects of employment on survival selection, and the result is robust. 8. Appendix: Proofs for Identification 8.1. Lemma 1 (Representation). Proof. (i) First, we show that there exists a function g¯ such that (¯ g , Uniform(0, 1)) is observationally equivalent to (g, FE ) for any (g, FE ) satisfying Restrictions 1 and 2. By the absolute continuity and the convex support in Restriction 1, FE is invertible. Hence, we can define h := FE−1 . Now, define g¯ by g¯(y, u, · ) := g(y, u, · ) ◦ h−1 for 41 Height as Mother’s Education Father’s Occupation Cohorts a Proxy as a Proxy as a Proxy 51-54 in 1971 55-58 in 1971 Figure 1.10. Counterfactual simulations. each (y, u). Note that, under Restriction 2, (¯ g , Fh(E) ) is observationally equivalent to (g, FE ) by construction. However, we have h(E) ∼ Uniform(0, 1) by the definition of h. It follows that (¯ g , Uniform(0, 1)) is observationally equivalent to (g, FE ). In light of the previous paragraph, we can impose the normalization Et ∼ Uniform(0, 1). Let Λ(y, u, ε) denote the set Λ(y, u, ε) = {y ′ ∈ g(y, u, (0, 1)) | ε ≤ FY3 |Y2 U (y ′ | y, u)}, where g(y, u, (0, 1)) denotes the set {g(y, u, ε) | ε ∈ (0, 1)}. I claim that g(y, u, ε) = inf Λ(y, u, ε). First, we note that g(y, u, ε) ∈ Λ(y, u, ε). To see this, calculate FY3 |Y2 U (g(y, u, ε) | y, u) = Pr(g(y, u, E3 ) ≤ g(y, u, ε) | Y2 = y, U = u) = Pr(g(y, u, E3 ) ≤ g(y, u, ε)) ≥ Pr(E3 ≤ ε) = ε, 42 where the first equality follows from Y3 = g(y, u, E3 ) given (Y2 , U ) = (y, u), the second equality follows from Restriction 2 (i), the next inequality follows from the non-decrease of g(y, u, ·) by Restriction 1 together with monotonicity of the prob- ability measure, and the last equality is due to Et ∼ U (0, 1). This shows that ε ≤ FY3 |Y2 U (g(y, u, ε) | y, u), hence g(y, u, ε) ∈ Λ(y, u, ε). Second, I show that g(y, u, ε) is a lower bound of Λ(y, u, ε). Let y ′ ∈ Λ(y, u, ε). Since g is non-decreasing and c`agl`ad (left-continuous) in the third argument by Re- striction 1, we can define ε′ := max{ε ∈ (0, 1) | g(y, u, ε) = y ′ }. But then, FY3 |Y2 U (y ′ | y, u) = Pr(g(y, u, E3 ) ≤ y ′ | Y2 = y, U = u) = Pr(g(y, u, E3 ) ≤ y ′ ) = ε′ , where the first equality follows from Y3 = g(y, u, E3 ) given (Y2 , U ) = (y, u), the second equality follows from Restriction 2 (i), and the last equality follows from the definition of ε′ together with the non-decrease of g(y, u, ·) by Restriction 1 and Et ∼ U (0, 1). Using this result, in turn, yields g(y, u, ε) ≤ g(y, u, FY3 |Y2 U (y ′ | y, u)) = g(y, u, ε′ ) = y ′ , where the first inequality follows from ε ≤ FY3 |Y2 U (y ′ | y, u) by definition of y ′ as well as the non-decrease of g(y, u, ·) by Restriction 1, the next equality follows from the previous result FY3 |Y2 U (y ′ | y, u) = ε′ , and the last equality follows from the definition of ε′ . Since y ′ was chosen as an arbitrary element of Λ(y, u, ε), this shows that g(y, u, ε) is indeed a lower bound of it. Therefore, g(y, u, ε) = inf{y ′ ∈ g(y, u, (0, 1)) | ε ≤ FY3 |Y2 U (y ′ | y, u)}, and g is uniquely determined by FY3 |Y2 U . (Moreover, note that inf{y ′ ∈ g(y, u, (0, 1)) | ε ≤ FY3 |Y2 U (y ′ | y, u)} coincides with the definition of the quantile regression FY−1 3 |Y2 U ( · | y, u), hence g is identified by this quantile regression, i.e., g(y, u, ε) = FY−1 3 |Y2 U (ε | y, u).) Part (ii) of the lemma can be proved in exactly the same way as in the proof of part (i). In particular, h is identified by the quantile regression: h(y, u, v) = FD−12 |Y2 U (v | 43 y, u). Similarly, part (iii) of the the lemma can be proved in the same way, and ζ is −1 identified by the quantile regression: ζ(u, w) = FZ|U (w | u).  8.2. Lemma 2 (Identification). Proof. We will construct six steps for a proof of this lemma. The first step shows that the observed joint distributions uniquely determine FY3 |Y2 U and FZ|U by a spectral decomposition of a composite linear operator. The second step is auxiliary, and shows that FY2 Y1 U D1 ( · , · , · , 1) is uniquely determined from the observed joint distributions together with inversion of the operators identified in the first step. The third step again uses spectral decomposition to identify an auxiliary operator with the kernel represented by FY1 |Y2 U D2 D1 ( · | · , · , 1, 1). In the fourth step, solving an integral equation with the adjoint of this auxiliary operator in turn yields another auxiliary operator with the multiplier represented by FY2 U D2 D1 ( · , · , 1, 1). The fifth step uses the three operators identified in Steps 2, 3, and 4 to identify an operator with the kernel represented by FD2 |Y2 U by solving a linear inverse problem. The last step uses results from Steps 1, 2, and 5 to show that the initial joint distribution FY1 U is uniquely determined from the observed joint distributions. These six steps together prove that the observed joint distributions FY2 Y1 ZD1 ( · , · , · , 1) and FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1) uniquely determine the model (FY3 |Y2 U , FD2 |Y2 U , FY1 U , FZ|U ) as claimed in the lemma. Given fixed y and z, define the operators Ly,z : L2 (FYt ) → L2 (FYt ), Py : L2 (FU ) → L2 (FYt ), Qz : L2 (FU ) → L2 (FU ), Ry : L2 (FU ) → L2 (FU ), Sy : L2 (FYt ) → L2 (FU ), 44 Ty : L2 (FYt ) → L2 (FU ), and Ty′ : L2 (FU ) → L2 (FU ) by ∫ (Ly,z ξ)(y3 ) = fY3 Y2 Y1 ZD2 D1 (y3 , y, y1 , z, 1, 1) · ξ(y1 )dy1 , ∫ (Py ξ)(y3 ) = fY3 |Y2 U (y3 | y, u) · ξ(u)du, (Qz ξ)(u) = fZ|U (z | u) · ξ(u), (Ry ξ)(u) = fD2 |Y2 U (1 | y, u) · ξ(u), ∫ (Sy ξ)(u) = fY2 Y1 U D1 (y, y1 , u, 1) · ξ(y1 )dy1 , ∫ (Ty ξ)(u) = fY1 |Y2 U D2 D1 (y1 | y, u, 1, 1) · ξ(y1 )dy1 , (Ty′ ξ)(u) = fY2 U D2 D1 (y, u, 1, 1) · ξ(u) respectively. We consider L2 spaces as the normed linear spaces on which these operators are defined, particularly in order to guarantee the existence of its adjoint operator Ty∗ to be introduced in Step 4. (Recall that a bounded linear operator between Hilbert spaces admits existence of its adjoint operator.) Identification of the operator leads to that of the associated conditional density (up to null sets), and vice versa. Here, the operators Ly,z , Py , Sy , and Ty are integral operators whereas Qz , Ry , and Ty′ are multiplication operators. Note that Ly,z is identified from observed joint distribution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , · , 1, 1). Figure 3.1 illustrates six steps toward identification of (FY3 |Y2 U , FD2 |Y2 U , FY1 U , FZ|U ) from the observed joint distributions FY2 Y1 ZD1 ( ·, ·, ·, 1) and FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1). The four objects (g, h, FY1 U , ζ) of interest are enclosed by double lines. The objects that can be observed from data are enclosed by dashed-lines All the other objects are intermediary, and are enclosed by solid lines. Starting out with the observed objects, we show in each step that the intermediary objects are uniquely determined. These uniquely determined intermediary objects in turn show the uniqueness of the four objects (g, h, FY1 U , ζ) of interest. Step 1: Uniqueness of FY3 |Y2 U and FZ|U The kernel fY3 Y2 Y1 ZD2 D1 ( · , y, · , z, 1, 1) of the integral operator Ly,z can be rewritten 45 as ∫ fY3 Y2 Y1 ZD2 D1 (y3 , y, y1 , z, 1, 1) = fY3 |Y2 Y1 ZU D2 D1 (y3 | y, y1 , z, u, 1, 1) (13) ×fZ|Y2 Y1 U D2 D1 (z | y, y1 , u, 1, 1) ×fD2 |Y2 Y1 U D1 (1 | y, y1 , u, 1) · fY2 Y1 U D1 (y, y1 , u, 1) du But by Lemma 3 (i), (iv), and (iii), respectively, Restriction 2 implies that fY3 |Y2 Y1 ZU D2 D1 (y3 | y, y1 , z, u, 1, 1) = fY3 |Y2 U (y3 | y, u), fZ|Y2 Y1 U D2 D1 (z | y, y1 , u, 1, 1) = fZ|U (z | u), fD2 |Y2 Y1 U D1 (1 | y, y1 , u, 1) = fD2 |Y2 U (1 | y, u). Equation (13) thus can be rewritten as ∫ fY3 Y2 Y1 ZD2 D1 (y3 , y, y1 , z, 1, 1) = fY3 |Y2 U (y3 | y, u) · fZ|U (z | u) ×fD2 |Y2 U (1 | y, u) · fY2 Y1 U D1 (y, y1 , u, 1) du But this implies that the integral operator Ly,z is written as the operator composition Ly,z = Py Qz Ry Sy . Restriction 3 (i), (ii), (iii), and (iv) imply that the operators Py , Qz , Ry , and Sy are invertible, respectively. Hence so is Ly,z . Using the two values {0, 1} of Z, form the product Ly,1 L−1 −1 y,0 = Py Q1/0 Py where Qz/z′ := Qz Q−1 z ′ is the multiplication operator with proxy odds defined by fZ|U (1 | u) (Q1/0 ξ)(u) = ξ(u). fZ|U (0 | u) By Restriction 3 (ii), the operator Ly,1 L−1 −1 y,0 is bounded. The expression Ly,1 Ly,0 = Py Q1/0 Py−1 thus allows unique eigenvalue-eigenfunction decomposition similarly to that of Hu and Schennach (2008). The distinct proxy odds as in Restriction 3 (ii) guarantee distinct eigenvalues and single dimensionality of the eigenspace associated with each eigenvalue. Within each 46 of the single-dimensional eigenspace is a unique eigenfunction pinned down by L1 - normalization because of the unity of integrated densities. The eigenvalues λ(u) yield the multiplier of the operator Q1/0 , hence λ(u) = fZ|U (1 | u)/fZ|U (0 | u). This proxy odds in turn identifies fZ|U ( · | u) since Z is binary. The corresponding normalized eigenfunctions are the kernels of the integral operator Py , hence fY3 |Y2 U ( · | y, u). Lastly, Restriction 4 facilitates unique ordering of the eigenfunctions fY3 |Y2 U ( · | y, u) by the distinct concrete values of u = λ(u). This is feasible because the eigenvalues λ(u) = fZ|U (1 | u)/fZ|U (0 | u) are invariant from y. That is, eigenfunctions fY3 |Y2 U ( · | y, u) of the operator Ly,1 L−1 y,0 across different y can be uniquely ordered in u invariantly from y by the common set of ordered distinct eigenvalues u = λ(u). Therefore, FY3 |Y2 U and FZ|U are uniquely determined by the observed joint distri- bution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1). Equivalently, the operators Py and Qz are uniquely determined for each y and z, respectively. Step 2: Uniqueness of FY2 Y1 U D1 ( · , · , · , 1) By Lemma 3 (ii), Restriction 2 implies fY2 |Y1 U D1 (y ′ | y, u, 1) = fY2 |Y1 U (y ′ | y, u). Using this equality, write the density of the observed joint distribution FY2 Y1 D1 ( · , · , 1) as ∫ ′ fY2 Y1 D1 (y , y, 1) = fY2 |Y1 U D1 (y ′ | y, u, 1)fY1 U D1 (y, u, 1)du ∫ (14) = fY2 |Y1 U (y ′ | y, u)fY1 U D1 (y, u, 1)du By Lemma 4 (i), FY3 |Y2 U (y ′ | y, u) = FY2 |Y1 U (y ′ | y, u) for all y ′ , y, u. Therefore, we can write the operator Py as ∫ ∫ ′ ′ (Py ξ)(y ) = fY3 |Y2 U (y | y, u) · ξ(u)du = fY2 |Y1 U (y ′ | y, u) · ξ(u)du. With this operator notation, it follows from (14) that fY2 Y1 D1 ( · , y, 1) = Py fY1 U D1 (y, · , 1). By Restriction 3 (i), this operator equation can be solved for fY1 U D1 (y, · , 1) as (15) fY1 U D1 (y, · , 1) = Py−1 fY2 Y1 D1 ( · , y, 1) 47 Recall that Py was shown in Step 1 to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1). The function fY2 Y1 D1 ( · , y, 1) is also uniquely determined by the observed joint distribution FY2 Y1 D1 ( · , · , 1) up to null sets. Therefore, (14) shows that fY1 U D1 ( · , · , 1) is uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1). Using the solution to the above inverse problem, we can write the kernel of the operator Sy as fY2 Y1 U D1 (y ′ , y, u, 1) = fY2 |Y1 U D1 (y ′ | y, u, 1) · fY1 U D1 (y, u, 1) = fY2 |Y1 U (y ′ | y, u) · fY1 U D1 (y, u, 1) = fY3 |Y2 U (y ′ | y, u) · fY1 U D1 (y, u, 1) = fY3 |Y2 U (y ′ | y, u) · [Py−1 fY2 Y1 D1 ( · , y, 1)](u) where the second equality follows from Lemma 3 (ii), the third equality follows from Lemma 4 (i), and the forth equality follows from (15). Since fY3 |Y2 U was shown in Step 1 to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and [Py−1 fY2 Y1 D1 ( · , y, 1)] was shown in the pre- vious paragraph to be uniquely determined for each y by the observed joint dis- tributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1), it follows that fY2 Y1 U D1 ( · , · , · , 1) too is uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1). Equivalently, the operator Sy is uniquely determined for each y. Step 3: Uniqueness of FY1 |Y2 U D2 D1 ( · | · , · , 1, 1) First, note that the kernel of the composite operator Ty′ Ty can be written as (16) fY2 U D2 D1 (y, u, 1, 1) · fY1 |Y2 U D2 D1 (y1 | y, u, 1, 1) = fY2 Y1 U D2 D1 (y, y1 , u, 1, 1) = fD2 |Y2 Y1 U D1 (1 | y, y1 , u, 1) · fY2 Y1 U D1 (y, y1 , u, 1) = fD2 |Y2 U (1 | y, u) · fY2 Y1 U D1 (y, y1 , u, 1) 48 where the last equality is due to Lemma 3 (iii). But the last expression corresponds to the kernel of the composite operator Ry Sy , thus showing that Ty′ Ty = Ry Sy . But then, Ly,z = Py Qz Ry Sy = Py Qz Ty′ Ty . Note that the invertibility of Ry and Sy as required by Assumption 3 implies invertibility of Ty′ and Ty as well, for otherwise the equivalent composite operator Ty′ Ty = Ry Sy would have a nontrivial nullspace. Using Restriction 3, form the product of operators as in Step 1, but in the opposite order as L−1 −1 y,0 Ly,1 = Ty Q1/0 Ty The disappearance of Ty′ is due to commutativity of multiplication operators. By the same logic as in Step 1, this expression together with Restriction 3 (ii) admits unique left eigenvalue-eigenfunction decomposition. Moreover, the point spectrum is exactly the same as the one in Step 1, as is the middle multiplication operator Q1/0 . This equivalence of the spectrum allows consistent ordering of U with that of Step 1. Left eigenfunctions yield the kernel of Ty pinned down by the normalization of unit integral. This shows that the operator Ty is uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1). Step 4: Uniqueness of FY2 U D2 D1 ( · , · , 1, 1) Equation (16) implies that ∫ fY1 |Y2 U D2 D1 (y1 | y, u, 1, 1) · fY2 U D2 D1 (y, u, 1, 1)du = fY2 Y1 D2 D1 (y, y1 , 1, 1) hence yielding the linear operator equation Ty∗ fY2 U D2 D1 (y, ·, 1, 1) = fY2 Y1 D2 D1 (y, ·, 1, 1) where Ty∗ denotes the adjoint operator of Ty . Since Ty is invertible, so is its adjoint Ty∗ . But then, the multiplier of the multiplication operator Ty′ can be given by the unique solution to the above linear operator equation, i.e., fY2 U D2 D1 (y, ·, 1, 1) = (Ty∗ )−1 fY2 Y1 D2 D1 (y, ·, 1, 1) 49 Ty hence Ty∗ was shown to be uniquely determined by FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) in Step 3, and fY2 Y1 D2 D1 ( ·, ·, 1, 1) is also available from observed data. Therefore, the operator Ty′ is uniquely determined by FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1). Step 5: Uniqueness of FD2 |Y2 U (1 | · , · ) First, the definition of the operators Ry , Sy , Ty , and Ty′ and Lemma 3 (iii) yield the operator equality Ry Sy = Ty′ Ty , where Ty and Ty′ have been shown to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) in Steps 3 and 4, respectively. Recall that Sy was also shown in Step 2 to be uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1). Restriction 3 (iv) guarantees invertibility of Sy . It follows that the operator inversion Ry = (Ry Sy )Sy−1 = (Ty′ Ty )Sy−1 yields the operator Ry , in turn showing that its multiplier fD2 |Y2 U (1 | y, · ) is uniquely determined for each y by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1). Step 6: Uniqueness of FY1 U Recall from Step 2 that fY2 Y1 U D1 ( · , · , · , 1) is uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1). We can write fY2 Y1 U D1 (y ′ , y, u, 1) = fY2 |Y1 U D1 (y ′ | y, u, 1)fD1 |Y1 U (1 | y, u)fY1 U (y, u) = fY2 |Y1 U (y ′ | y, u)fD1 |Y1 U (1 | y, u)fY1 U (y, u) = fY3 |Y2 U (y ′ | y, u)fD2 |Y2 U (1 | y, u)fY1 U (y, u), where the second equality follows from Lemma 3 (ii), and the third equality follows from Lemma 4 (i) and (ii). For a given (y, u), there must exist some y ′ such that fY3 |Y2 U (y ′ | y, u) > 0 by a property of conditional density functions. Moreover, Re- striction 3 (iii) requires that fD2 |Y2 U (1 | y, u) > 0 for a given y for all u. Therefore, for such a choice of y ′ , we can write fY2 Y1 U D1 (y ′ , y, u, 1) fY1 U (y, u) = fY3 |Y2 U (y ′ | y, u)fD2 |Y2 U (1 | y, u) 50 Recall that fY3 |Y2 U ( · | · , · ) was shown in Step 1 to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1), fY2 Y1 U D1 ( · , · , · , 1) was shown in Step 2 to be uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1), and fD2 |Y2 U (1 | · , · ) was shown in Step 5 to be uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1). Therefore, it follows that the initial joint density fY1 U is uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD2 D1 ( · , · , · , · , 1, 1) and FY2 Y1 D1 ( · , · , 1).  8.3. Lemma 3 (Independence). Lemma 3 (Independence). The following implications hold: (i) Restriction 2 (i) ⇒ E3 ⊥⊥ (U, Y1 , E2 , V1 , V2 , W ) ⇒ Y3 ⊥⊥ (Y1 , D1 , D2 , Z) | (Y2 , U ). (ii) Restriction 2 (i) ⇒ E2 ⊥⊥ (U, Y1 , V1 , W ) ⇒ Y2 ⊥⊥ (D1 , Z) | (Y1 , U ). (iii) Restriction 2 (ii) ⇒ V2 ⊥⊥ (U, Y1 , E2 , V1 ) ⇒ D2 ⊥⊥ (Y1 , D1 ) | (Y2 , U ). (iv) Restriction 2 (iii) ⇒ W ⊥⊥ (Y1 , E2 , V1 , V2 ) ⇒ Z ⊥⊥ (Y2 , Y1 , D2 , D1 ) | U . Proof. In order to prove the lemma, we use the following two properties of conditional independence: CI.1. A ⊥⊥ B implies A ⊥⊥ B | ϕ(B) for any Borel function ϕ. CI.2. A ⊥⊥ B | C implies A ⊥⊥ ϕ(B, C) | C for any Borel function ϕ. (i) First, note that Restriction 2 (i) E3 ⊥⊥ (U, Y1 , E2 , V1 , V2 , W ) together with the structural definition Z = ζ(U, W ) implies E3 ⊥⊥ (U, Y1 , E2 , V1 , V2 , Z). Applying CI.1 to this independence relation E3 ⊥⊥ (U, Y1 , E2 , V1 , V2 , Z) yields E3 ⊥⊥ (U, Y1 , E2 , V1 , V2 , Z) | (g(Y1 , U, E2 ), U ). Since Y2 = g(Y1 , U, E2 ), it can be rewritten as E3 ⊥ ⊥ (U, Y1 , E2 , V1 , V2 , Z) | (Y2 , U ). Next, applying CI.2 to this conditional independence yields E3 ⊥⊥ (Y1 , h(Y1 , U, V1 ), h(Y2 , U, V2 ), Z) | (Y2 , U ). 51 Since Dt = h(Yt , U, Vt ) for each t ∈ {1, 2}, it can be rewritten as E3 ⊥⊥ (Y1 , D1 , D2 , Z) | (Y2 , U ). Lastly, applying CI.2 again to this conditional independence yields g(Y2 , U, E3 ) ⊥ ⊥ (Y1 , D1 , D2 , Z) | (Y2 , U ). Since Y3 = g(Y2 , U, E3 ), it can be rewritten as Y3 ⊥⊥ (Y1 , D1 , D2 , Z) | (Y2 , U ). (ii) Note that Restriction 2 (i) E2 ⊥⊥ (U, Y1 , V1 , W ) together with the structural definition Z = ζ(U, W ) implies E2 ⊥⊥ (U, Y1 , V1 , Z). Applying CI.1 to this indepen- dence relation E2 ⊥⊥ (U, Y1 , V1 , Z) yields E2 ⊥⊥ (U, Y1 , V1 , Z) | (Y1 , U ). Next, applying CI.2 to this conditional independence yields g(Y1 , U, E2 ) ⊥⊥ (U, Y1 , V1 , Z) | (Y1 , U ). Since Y2 = g(Y1 , U, E2 ), it can be rewritten as Y2 ⊥⊥ (U, Y1 , V1 , Z) | (Y1 , U ). Lastly, applying CI.2 again to this conditional independence yields Y2 ⊥⊥ (h(Y1 , U, V1 ), Z) | (Y1 , U ). Since D1 = h(Y1 , U, V1 ), it can be rewritten as Y2 ⊥⊥ (D1 , Z) | (Y1 , U ). (iii) Applying CI.1 to Restriction 2 (ii) V2 ⊥ ⊥ (U, Y1 , E2 , V1 ) yields V2 ⊥⊥ (U, Y1 , E2 , V1 ) | (g(Y1 , U, E2 ), U ). Since Y2 = g(Y1 , U, E2 ), it can be rewritten as V2 ⊥⊥ (U, Y1 , E2 , V1 ) | (Y2 , U ). Next, applying CI.2 to this conditional independence yields V2 ⊥⊥ (Y1 , h(Y1 , U, V1 )) | (Y2 , U ). Since D1 = h(Y1 , U, V1 ), it can be rewritten as V2 ⊥⊥ (Y1 , D1 ) | (Y2 , U ). Lastly, applying CI.2 to this conditional independence yields h(Y2 , U, V2 ) ⊥⊥ (Y1 , D1 ) | (Y2 , U ). Since D2 = h(Y2 , U, V2 ), it can be rewritten as D2 ⊥ ⊥ (Y1 , D1 ) | (Y2 , U ). 52 (iv) Note that Restriction 2 (iii) W ⊥ ⊥ (Y1 , E2 , V1 , V2 ) together with the struc- tural definition Z = ζ(U, W ) yields Z ⊥ ⊥ (Y1 , E2 , V1 , V2 ) | U . Applying CI.2 to this conditional independence relation Z ⊥ ⊥ (Y1 , E2 , V1 , V2 ) | U yields Z⊥ ⊥ (Y1 , g(Y1 , U, E2 ), h(Y1 , U, V1 ), h(g(Y1 , U, E2 ), U, V2 )) | U. Since Dt = h(Yt , U, Vt ) for each t ∈ {1, 2} and Y2 = g(Y1 , U, E2 ), this conditional independence can be rewritten as Z ⊥⊥ (Y1 , Y2 , D1 , D2 ) | U.  8.4. Lemma 4 (Invariant Transition). Lemma 4 (Invariant Transition). (i) Under Restrictions 1 and 2 (i), FY3 |Y2 U (y ′ | y, u) = FY2 |Y1 U (y ′ | y, u) for all y ′ , y, u. (ii) Under Restrictions 1 and 2 (ii), FD2 |Y2 U (d | y, u) = FD1 |Y1 U (d | y, u) for all d, y, u. Proof. (i) First, note that Restriction 2 (i) E3 ⊥⊥ (U, Y1 , E2 , V1 , V2 , W ) implies E3 ⊥⊥ (U, Y1 , E2 ), which in turn implies that E3 ⊥ ⊥ (g(Y1 , U, E2 ), U ), hence E3 ⊥⊥ (Y2 , U ). Second, Restriction 2 (i) in particular yields E2 ⊥ ⊥ (Y1 , U ). Using these two independence results, we obtain FY3 |Y2 U (y ′ | y, u) = Pr[g(y, u, E3 ) ≤ y ′ | Y2 = y, U = u] = Pr[g(y, u, E3 ) ≤ y ′ ] = Pr[g(y, u, E2 ) ≤ y ′ ] = Pr[g(y, u, E2 ) ≤ y ′ | Y1 = y, U = u] = FY2 |Y1 U (y ′ | y, u) for all y ′ , y, u, where the second equality follows from E3 ⊥⊥ (Y2 , U ), the third equality follows from identical distribution of Et by Restriction 1, and the forth equality follows from E2 ⊥⊥ (Y1 , U ). (ii) Restriction 2 (ii) V2 ⊥⊥ (U, Y1 , E1 , E2 , V1 ) implies that V2 ⊥⊥ (g(Y1 , U, E2 ), U ), hence V2 ⊥⊥ (Y2 , U ). Restriction 2 (ii) also implies V1 ⊥⊥ (Y1 , U ). Using these two 53 independence results, we obtain FD2 |Y2 U (d | y, u) = Pr[h(y, u, V2 ) ≤ d | Y2 = y, U = u] = Pr[h(y, u, V2 ) ≤ d] = Pr[h(y, u, V1 ) ≤ d] = Pr[h(y, u, V1 ) ≤ d | Y1 = y, U = u] = FD1 |Y1 U (d | y, u) for all d, y, u, where the second equality follows from V2 ⊥ ⊥ (Y2 , U ), the third equality follows from identical distribution of Vt from Restriction 1, and the forth equality follows from V1 ⊥⊥ (Y1 , U ).  9. Appendix: Proofs for Estimation 9.1. Corollary 1 (Constrained Maximum Likelihood). Proof. Denote the supports of conditional densities by I1 = {(y2 , y1 , z) | fY2 Y1 Z|D2 D1 (y2 , y1 , z | 1) > 0} and I2 = {(y3 , y2 , y1 , z) | fY3 Y2 Y1 Z|D2 D1 (y3 , y2 , y1 , z | 1, 1) > 0}. The Kullback-Leibler information inequality requires that ∫ [ ] fY2 Y1 Z|D1 (y2 , y1 , z | 1) log fY2 Y1 Z|D1 (y2 , y1 , z | 1)dµ(y2 , y1 , z) ≥ 0 and I1 φ(y2 , y1 , z) ∫ [ ] fY3 Y2 Y1 Z|D2 D1 (y3 , y2 , y1 , z | 1, 1) log fY3 Y2 Y1 Z|D2 D1 (y3 , y2 , y1 , z | 1, 1)dµ(y3 , y2 , y1 , z) ≥ 0 I2 ψ(y3 , y2 , y1 , z) ∫ ∫ for all non-negative measurable functions φ and ψ such that φ= ψ = 1. These two inequalities hold with equalities if and only if fY2 Y1 Z|D1 ( ·, ·, · | 1) = φ and fY3 Y2 Y1 Z|D2 D1 ( ·, ·, ·, · | 1, 1) = ψ, respectively. (Equalities and uniqueness are stated up to the equivalence classes identified by the underlying probability measures.) Let the set of such pairs of functions (φ, ψ) satisfying the above two Kullback-Leibler inequalities be denoted by { ∫ ∫ } Λ = (φ, ψ) φ and ψ are non-negative measurable functions with φ= ψ=1 . With this notation, the maximization problem (17) max c1 E [ log φ(Y2 , Y1 , Z)| D1 = 1] + c2 E [log ψ(Y3 , Y2 , Y1 , Z)| D2 = D1 = 1] (φ,ψ)∈Λ has the unique solution (φ, ψ) = (fY2 Y1 Z|D1 ( ·, ·, · | 1), fY3 Y2 Y1 Z|D2 D1 ( ·, ·, ·, · | 1, 1)). 54 Now, let F ( · ; M ) denote a distribution function generated by model M ∈ F. For the true model M ∗ := (FY∗t |Yt−1 U , FD∗ t |Yt U , FY∗1 U , FZ|U ∗ ), we have FY2 Y1 ZD1 ( ·, ·, ·, 1) = FY2 Y1 ZD1 ( ·, ·, ·, 1; M ∗ ) FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1) = FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1; M ∗ ) Moreover, the identification result of Lemma 2 showed that this true model M ∗ is the unique element in F that generates the observed parts of the joint distributions FY2 Y1 ZD1 ( ·, ·, ·, 1) and FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1), i.e., FY2 Y1 ZD1 ( ·, ·, ·, 1) = FY2 Y1 ZD1 ( ·, ·, ·, 1; M ) if and only if M = M ∗ FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1) = FY3 Y2 Y1 ZD2 D1 ( ·, ·, ·, ·, 1, 1; M ) if and only if M = M ∗ But this implies that F ∗ is the unique model that generates the observable condi- tional densities fY2 Y1 Z|D1 ( ·, ·, · | 1) and fY3 Y2 Y1 Z|D2 D1 ( ·, ·, ·, · | 1, 1) among those models M ∈ F that are compatible with the observed selection frequencies fD1 (1) and fD2 D1 (1, 1), i.e., fY2 Y1 Z|D1 ( ·, ·, · | 1) = fY2 Y1 Z|D1 ( ·, ·, · | 1; M ) if and only if M = M ∗ (18) given fD1 (1; M ) = fD1 (1), and fY3 Y2 Y1 Z|D2 D1 ( ·, ·, ·, · | 1, 1) = fY3 Y2 Y1 Z|D2 D1 ( ·, ·, ·, · | 1, 1; M ) if and only if M = M ∗ (19) given fD2 D1 (1, 1; M ) = fD2 D1 (1, 1) Since (φ, ψ) = (fY2 Y1 Z|D1 ( ·, ·, · | 1), fY3 Y2 Y1 Z|D2 D1 ( ·, ·, ·, · | 1, 1)) is the unique solution to (17), the statements (18) and (19) imply that the true model M ∗ is the unique solution to [ ] max c1 E log fY2 Y1 Z|D1 (Y2 , Y1 , Z | 1; M ) D1 = 1 M ∈F [ ] +c2 E log fY3 Y2 Y1 Z|D2 D1 (Y3 , Y2 , Y1 , Z | 1, 1; M ) D2 = D1 = 1 s.t. fD1 (1; M ) = fD1 (1) and fD2 D1 (1, 1; M ) = fD2 D1 (1, 1) 55 or equivalently max c1 E [ log fY2 Y1 ZD1 (Y2 , Y1 , Z, 1; M )| D1 = 1] M ∈F +c2 E [ log fY3 Y2 Y1 ZD2 D1 (Y3 , Y2 , Y1 , Z, 1, 1; M )| D2 = D1 = 1] (20) s.t. fD1 (1; M ) = fD1 (1) and fD2 D1 (1, 1; M ) = fD2 D1 (1, 1) since what have been omitted are constants due to the constraints. By using Lemmas 3 and 4, we can write the equalities ∫ fY2 Y1 ZD1 (y2 , y1 , z, 1; M ) = fYt |Yt−1 U (y2 | y1 , u)fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)fZ|U (z | u)dµ(u) ∫ fD1 (1; M ) = fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y1 , u) ∫ fY3 Y2 Y1 ZD2 D1 (y3 , y2 , y1 , z, 1, 1; M ) = fYt |Yt−1 U (y3 | y2 , u)fYt |Yt−1 U (y2 | y1 , u)fDt |Yt U (1 | y2 , u) fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)FZ|U (z | u)dµ(u) ∫ fD2 D1 (1, 1; M ) = fYt |Yt−1 U (y2 | y1 , u)fDt |Yt U (1 | y2 , u) fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y2 , y1 , u) for any model M := (FYt |Yt−1 U , FDt |Yt U , FY1 U , FZ|U ) ∈ F. Substituting these equalities in (20), we conclude that the true model (FY∗t |Yt−1 U , FD∗ t |Yt U , FY∗1 U , FZ|U ∗ ) is the unique solution to [ ∫ ( max ) c1 E log fYt |Yt−1 U (Y2 | Y1 , u)fDt |Yt U (1 | Y1 , u) FY |Y ,FD |Y U ,FY1 U ,FZ|U ∈F t t−1 U t t ] fY1 U (Y1 , u)fZ|U (Z | u)dµ(u) D1 = 1 + [ ∫ c2 E log fYt |Yt−1 U (Y3 | Y2 , u)fYt |Yt−1 U (Y2 | Y1 , u)fDt |Yt U (1 | Y2 , u)fDt |Yt U (1 | Y1 , u) ] fY1 U (Y1 , u)FZ|U (Z | u)dµ(u) D2 = D1 = 1 subject to ∫ fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y1 , u) = fD1 (1) and ∫ fYt |Yt−1 U (y2 | y1 , u)fDt |Yt U (1 | y2 , u)fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y2 , y1 , u) = fD2 D1 (1, 1) as claimed.  56 9.2. Remark 9 (Unit Lagrange Multipliers). For short-hand notation, we write f = (f1 , f2 , f3 , f4 ) ∈ F for an element of F, p1 := Pr(D1 = 1), p2 := Pr(D2 = D1 = 1), and p := (p1 , p2 )′ . The solution f ∗ (· ; p) ∈ F has Lagrange multipliers λ∗ (p) = (λ∗1 (p), λ∗2 (p))′ ∈ Λ such that (f ∗ (· ; p), λ∗ (p)) is a saddle point of the La- grangean functional L(f, λ; p) = p1 L1 (f ; p1 ) + p2 L2 (f ; p2 ) − λ1 (L3 (f ) − p1 ) − λ2 (L4 (f ) − p2 ), where the functionals L1 , · · · , L4 are defined as ∫ [ ∫ ] L1 (f ; p1 ) = log f1 (y2 | y1 , u)f2 (1 | y1 , u)f3 (y1 , u)f4 (z | u)dµ(u) × fY2 Y1 Z|D1 (y2 , y1 , z | 1)dµ(y2 , y1 , z) ∫ [ ∫ ] L2 (f ; p2 ) = log f1 (y3 | y2 , u)f1 (y2 | y1 , u)f2 (1 | y2 , u)f2 (1 | y1 , u)f3 (y1 , u)f4 (z | u)dµ(u) × fY3 Y2 Y1 Z|D2 D1 (y3 , y2 , y1 , z | 1, 1)dµ(y3 , y2 , y1 , z) ∫ L3 (f ) = f2 (1 | y1 , u)f3 (y1 , u)dµ(y1 , u) and ∫ L4 (f ) = f1 (y2 | y1 , u)f2 (1 | y2 , u)f2 (1 | y1 , u)f3 (y1 , u)dµ(y2 , y1 , u). Moreover, f ∗ (· ; p) maximizes L(f, λ∗ (p); p) given λ is restricted to λ∗ (p). We want to claim that λ∗ (p) = (1, 1)′ . The following assumptions are imposed to this end. Assumption (Regularity for Unit Lagrange Multipliers). (i) Selection probabilities are positive: p1 , p2 > 0. (ii) The functionals L1 , L2 , L3 , and L4 are Fr´echet differentiable with respect to f at the solution f ∗ (· ; p) for some norm ∥·∥ on a linear space containing F. (iii) The solution (f ∗ (· ; p), λ∗ (p)) is differentiable with respect to p. (iv) The solution f ∗ (· ; p) is a regular point of the constraint functionals L3 and L4 . A sufficient condition for part (ii) of this assumption will be provided later in terms of a concrete normed linear space. 57 Proof. Since the Chain Rule holds for a composition of Fr´echet differentiable transformations (cf. Luenberger, 1969; pp.176), we have d L(f ∗ (· ; p), λ∗ (p); p) = Df,λ L(f ∗ (· ; p), λ∗ (p); p) · Dp1 (f ∗ (· ; p), λ∗ (p)) dp1 ∂ ∂ + L(f ∗ (· ; p), λ∗ (p); p) = L(f ∗ (· ; p), λ∗ (p); p) ∂p1 ∂p1 where the second equality follows from the equality constraints and the stationarity of L(·, λ∗ (p); p) at f ∗ (· ; p), which is a regular point of the constraint functionals L3 and L4 by assumption. On one hand, the partial derivative is ∂ L(f ∗ (· ; p), λ∗ (p); p) = λ∗1 (p). ∂p1 On the other hand, the complementary slackness yields d d L(f ∗ (· ; p), λ∗ (p); p) = [p1 L1 (f ∗ (· ; p); p1 )]. dp1 dp1 In order to evaluate the last term, we first note that ∫ [ ∫ ] p1 L1 (f ; p1 ) = log f1 (y2 | y1 , u)f2 (1 | y1 , u)f3 (y1 , u)f4 (z | u)dµ(u) × fY2 Y1 ZD1 (y2 , y1 , z, 1)dµ(y2 , y1 , z). In view of the proof of Corollary 1, we recall that f ∗ (· ; p) maximizes p1 L1 (· ; p1 ), and the solution f ∗ (· ; p) satisfies ∫ f1∗ (y2 | y1 , u; p)f2∗ (1 | y1 , u; p)f3∗ (y1 , u; p)f4∗ (z | u; p)dµ(u) = fY2 Y1 Z|D1 (y2 , y1 , z | 1) · p1 , where the conditional density fY2 Y1 Z|D1 (·, ·, · | 1) is invariant from variations in p, and the scale of the integral varies by p1 which defines the L1 -equivalence class of non- negative functions over which the Kullback-Leibler information inequality is satisfied. Therefore, we have [ ∫ ] d ∗ ∗ ∗ ∗ 1 log f1 (y2 | y1 , u; p)f2 (1 | y1 , u; p)f3 (y1 , u; p)f4 (z | u; p)dµ(u) = . dp1 p1 It then follows that ∫ d d 1 L(f ∗ (· ; p), λ∗ (p); p) = [p1 L1 (f ∗ (· ; p); p1 )] = fY2 Y1 ZD1 (y2 , y1 , z, 1)dµ(y2 , y1 , z) = 1, dp1 dp1 p1 58 showing that λ∗1 (p) = 1. Similar lines of argument prove λ∗2 (p) = 1.  Part (ii) of the above assumption is ambiguous about the definition of underlying topological spaces, as we did not explicitly define the norm. In order to complement for it, here we consider a sufficient condition. Write F = F1 × F2 × F3 × F4 . Define a norm on F by ∥f ∥s := ∥f1 ∥2 + ∥f2 ∥2 + ∥f3 ∥2 + ∥f4 ∥2 , where ∥·∥2 denotes the L2 -norm. Also, define the set Bj (M ) = {fj ∈ Fj | ∥fj ∥∞ 6 M } for M ∈ (0, ∞) for each j = 1, 2, 3, 4. The following uniform boundedness and integrability together imply part (ii). Assumption (A Sufficient Condition for Part (ii)). There exists M < ∞ such that F1 ⊂ L1 ∩ B1 (M ), F2 ⊂ L1 ∩ B2 (M ), F3 ⊂ L1 ∩ B3 (M ), and F4 ⊂ L1 ∩ B4 (M ) hold with the respective Lebesgue measurable spaces. Note that F1 ⊂ L1 ∩ L∞ , F2 ⊂ L1 ∩ L∞ , F3 ⊂ L1 ∩ L∞ , and F4 ⊂ L1 ∩ L∞ follow from this assumption, since B(M ) ⊂ L∞ for each j = 1, 2, 3, 4. But then, each of these sets is also square integrable as L1 ∩ L∞ ⊂ L2 (cf. Folland, 1999; pp. 185). To see the Fr´echet differentiability of L1 , observe that for any η ∈ F ∫ ∫ (f1 + η1 )(f2 + η2 )(f3 + η3 )(f4 + η4 )dµ(u) − f1 f2 f3 f4 dµ(u) − DL1 (f ; η) 1 6 ∥f1 f2 η3 η4 ∥1 + ∥f1 η2 η3 f4 ∥1 + ∥η1 η2 f3 f4 ∥1 + ∥f1 η2 f3 η4 ∥1 + ∥η1 f2 η3 f4 ∥1 + ∥η1 f2 f3 η4 ∥1 + ∥f1 η2 η3 η4 ∥1 + ∥η1 f2 η3 η4 ∥1 + ∥η1 η2 f3 η4 ∥1 + ∥η1 η2 η3 f4 ∥1 + ∥η1 η2 η3 η4 ∥1 6 ∥f1 ∥∞ ∥f2 ∥∞ ∥η3 ∥2 ∥η4 ∥2 + ∥f1 ∥∞ ∥f4 ∥∞ ∥η2 ∥2 ∥η3 ∥2 + ∥f3 ∥∞ ∥f4 ∥∞ ∥η1 ∥2 ∥η2 ∥2 + ∥f1 ∥∞ ∥f3 ∥∞ ∥η2 ∥2 ∥η4 ∥2 + ∥f2 ∥∞ ∥f4 ∥∞ ∥η1 ∥2 ∥η3 ∥2 + ∥f2 ∥∞ ∥f3 ∥∞ ∥η1 ∥2 ∥η4 ∥2 + ∥f1 ∥∞ ∥η2 ∥∞ ∥η3 ∥2 ∥η4 ∥2 + ∥η1 ∥∞ ∥f2 ∥∞ ∥η3 ∥2 ∥η4 ∥2 + ∥η1 ∥∞ ∥f3 ∥∞ ∥η2 ∥2 ∥η4 ∥2 + ∥η1 ∥∞ ∥f4 ∥∞ ∥η2 ∥2 ∥η3 ∥2 + ∥η1 ∥∞ ∥η2 ∥∞ ∥η3 ∥2 ∥η4 ∥2 6 (∥f1 ∥∞ ∥f2 ∥∞ + ∥f1 ∥∞ ∥f4 ∥∞ + ∥f3 ∥∞ ∥f4 ∥∞ + ∥f1 ∥∞ ∥f3 ∥∞ + ∥f2 ∥∞ ∥f4 ∥∞ + ∥f2 ∥∞ ∥f3 ∥∞ + ∥f1 ∥∞ ∥η2 ∥∞ + ∥η1 ∥∞ ∥f2 ∥∞ + ∥η1 ∥∞ ∥f3 ∥∞ + ∥η1 ∥∞ ∥f4 ∥∞ + ∥η1 ∥∞ ∥η2 ∥∞ ) ∥η∥2s 6 11M 2 ∥η∥2s , 59 where the L1 -norm in the first line is by integration with respect to (y2 , y1 , z), all the remaining Lp -norms are by integration with respect to (y2 , y1 , z, u), DL1 (f ; η) := ∫ (f1 f2 f3 η4 + f1 f2 η3 f4 + f1 η2 f3 f4 + η1 f2 f3 f4 )dµ(u), the first inequality follows from the triangle inequality, the second inequality follows from the H¨older’s inequality, the third inequality follows from our definition of the norm on F, and the last inequality follows from our assumption. But then, ∫ ∫ (f1 + η1 )(f2 + η2 )(f3 + η3 )(f4 + η4 )dµ(u) − f1 f2 f3 f4 dµ(u) − DL1 (f ; η) 1 lim = 0, ∥η∥s →0 ∥η∥s ∫ showing that DL1 (f ; η) is the Fr´echet derivative of the operator f 7→ f1 f2 f3 f4 dµ(u). This in turn implies Fr´echet differentiability of the functional L1 at the solution ∫ f ∗ (· ; p), since the functional L1 ∋ η 7→ log ηdFY2 Y1 Z|D1 =1 is Fr´echet differentiable at η = fY2 Y1 ZD1 (·, ·, ·, 1). Similar lines of arguments will show Fr´echet differentiability of the other functionals L2 , L3 , and L4 at f ∗ (· ; p). 9.3. Proposition 1 (Consistency of the Nonparametric Estimator). As a setup, we define a normed linear space (L, ∥·∥) containing the model set F = F1 × F2 × F3 × F4 as follows. We define the uniform norm of f as the essential supremum ∥f ∥∞ = ess supx |f (x)| . Following Newey and Powell (2003) and others, we also define the following version of the uniform norm when characterizing compactness: ∥f ∥R,∞ = ess supx |f (x)(1 + x′ x)| . Noted that ∥·∥∞ 6 ∥·∥R,∞ holds. Similarly define the version of the L1 norm ∫ ∥f ∥R,1 = |f (x)| (1 + x′ x)dx. Define a norm ∥·∥ on a linear space containing F by ∥f ∥ := ∥f1 ∥R,∞ + ∥f2 ∥R,∞ + ∥f3 ∥R,∞ + ∥f4 ∥R,∞ . 60 We consider F with the subspace topology of this normed linear space, where As- sumption 3 below imposes restrictions on how to appropriately choose such a subset F. We assume that the data is i.i.d. Assumption 1 (Data). The data {(Yi3 , Yi2 , Yi1 , Zi , D2i , D1i )}ni=1 is i.i.d. In order to model the rate at which the complexity of sieve spaces evolve with sample size n, we introduce the notation N (· , · , ∥·∥) for the covering numbers without bracketing. Let B(f, ε) = {f ′ ∈ F | ∥f − f ′ ∥ < ε} denote the ε-ball around f ∈ F with respect to the norm ∥·∥ defined above. For each ε > 0 and n, let N (ε, Fk(n) , ∥·∥) denote the minimum number of such ε-balls covering Fk(n) , i.e., min{|C| | ∪f ∈C B(f, ε) ⊃ Fk(n) }. With this notation, we assume the following restriction. Assumption 2 (Sieve Spaces). (i) {Fk(n) }∞ n=1 is an increasing sequence, Fk(n) ⊂ F for each n, and there exists a sequence {πk(n) f0 }∞ n=1 such that πk(n) f0 ∈ Fk(n) for each n. (ii) log N (ε, Fk(n) , ∥·∥) = o(n) for all ε > 0. The next assumption facilitates compactness of the model set and H¨older conti- nuity of the objective functional, both of which are important for nice large sample behavior of the estimator. We assume that the true model f0 belongs to F satisfying the following. Assumption 3 (Model Set). (i) L1 Compactness: Each of F2 and F3 is compact with respect to ∥·∥R,1 . Thus, let M < ∞ be a number such that supfi ∈Fi ∥fi ∥R,1 6 M for each i = 2, 3. (ii) L∞ Compactness: Each of F1 , F2 , F3 , and F4 is compact with respect to ∥·∥R,∞ . Thus, let M∞ < ∞ be a number such that supfi ∈Fi ∥fi ∥R,∞ 6 M∞ for each i = 1, 2, 3, 4. 61 (iii) Uniformly Bounded Density of E: There exists M1 < ∞ such that ∫ sup sup |f1 (y2 | y1 , u)| du 6 M1 . f1 ∈F1 y2 ,y1 (iv) Uniformly Bounded Density of Y1 : There exists M3 < ∞ such that ∫ sup sup |f3 (y1 , u)| du 6 M3 . f3 ∈F3 y1 (v) Bounded Objective: [( ∫ )−1 ] E inf f1 (Y2 | Y1 , u)f2 (1 | Y1 , u)f3 (Y1 , u)f4 (Z | u)dµ(u) < ∞ and f ∈F [( ∫ )−1 ] E inf f1 (Y3 | Y2 , u)f1 (Y2 | Y1 , u)f2 (1 | Y2 , u)f2 (1 | Y1 , u)f3 (Y1 , u)f4 (Z | u)dµ(u) <∞ f ∈F Part (i) of this assumption is not redundant, since fi are not densities, but con- ditional densities. Despite their appearance, parts (iii) and (iv) of this assumption are not so stringent. Suppose, for example, that the true model f0 consists of the traditional additively separable dynamic model Yt = αYt−1 + U + Et with a uniformly bounded density of Et . In this case, the true model f0 can indeed reside in an F satisfying the restriction of part (iii) for a suitable choice of M1 . Similarly, the true model f0 can reside in an F satisfying the restriction of part (iv) for a suitable choice of M3 , whenever the density of Y1 is uniformly bounded. Proof. We show the consistency claim of Proposition 1 by showing that Con- ditions 3.1, 3.2, 3.4, and 3.5M of Chen (2007) are satisfied by our assumptions (Re- strictions 1, 2, 3, and 4 and Assumptions 1, 2, and 3). Restrictions 1, 2, 3, and 4 imply her Condition 3.1 by our identification result yielding Corollary 1 together with Remark 9. Her Condition 3.2 is directly assumed by our Assumption 2 (i). Her Con- dition 3.4 is implied by our Assumption 3 (ii) applied to the Tychonoff’s Theorem. Her Conditions 3.5M (i) and (iii) are directly assumed by our Assumptions 1 and 2 (ii), respectively, provided that we will prove her Condition 3.5M (ii) with s = 1. It remains to prove H¨older continuity of l(y3 , y2 , y1 , z, d2 , d1 ; · ) : (F, ∥·∥) → R for each (y3 , y2 , y1 , z, d2 , d1 ), which in turn implies her Condition 3.5M (ii). 62 In order to show H¨older continuity of the functional l(y3 , y2 , y1 , z, d2 , d1 ; · ), it suffices to prove that of l1 (y2 , y1 , z; · ), l2 (y3 , y2 , y1 , z; · ), l3 , and l4 . First, consider l1 (y2 , y1 , z; · ). For a fixed (y2 , y1 , z), observe exp(l1 (y2 , y1 , z; f )) − exp(l1 (y2 , y1 , z; f¯)) ∫ ∫ 6 (f1 − f1 )f2 f3 f4 du + f1 (f2 − f2 )f3 f4 du ¯ ¯ ¯ ∫ ∫ + f1 f2 (f3 − f3 )f4 du + f1 f2 f3 (f4 − f4 )du ¯ ¯ ¯ ¯ ¯ ¯ ¯ ∫ ∫ 6 f1 − f¯1 ∞ ∥f2 ∥∞ |f3 | du ∥f4 ∥∞ + f¯1 ∞ f2 − f¯2 ∞ |f3 | du ∥f4 ∥∞ ∫ ∫ + f1 du f2 ∞ f3 − f3 ∞ ∥f4 ∥∞ + f¯1 du f¯2 ∞ f¯3 ∞ f4 − f¯4 ∞ ¯ ¯ ¯ 6 2M∞ 2 (M1 + M3 ) f − f¯ , where the first inequality follows from the triangle inequality, the second inequality follows from the H¨older’s inequality, and the third inequality uses Assumption 3 (ii), (iii), and (iv), together with the fact that ∥·∥∞ 6 ∥·∥R,∞ . By Assumption 3 (v), there exists a function κ1 such that E[κ1 (Y2 , Y1 , Z)] < ∞ and l1 (y2 , y1 , z; f ) − l1 (y2 , y1 , z; f¯) 6 2M∞ 2 (M1 + M3 ) f − f¯ κ1 (y2 , y1 , z). This shows H¨older (in particular Lipschitz) continuity of the functional l1 (y2 , y1 , z; · ). By similar calculations using Assumption 3 (ii) and (iii), we obtain exp(l2 (y3 , y2 , y1 , z; f )) − exp(l2 (y3 , y2 , y1 , z; f¯)) ∫ ∫ 6 f1 − f¯1 ∞ |f1 | du ∥f2 ∥∞ ∥f3 ∥∞ ∥f4 ∥∞ + f¯1 du f1 − f¯1 ∞ ∥f2 ∥∞ ∥f3 ∥∞ ∥f4 ∥∞ 2 2 ∫ + f¯1 du f¯1 ∞ f2 − f¯2 ∞ ∥f2 ∥∞ ∥f3 ∥∞ ∥f4 ∥∞ ∫ + f¯1 du f¯1 ∞ f¯2 ∞ f2 − f¯2 ∞ ∥f3 ∥∞ ∥f4 ∥∞ ∫ ∫ 2 2 + f¯1 du f¯1 f¯2 f3 − f¯3 ∥f4 ∥ + f¯1 du f¯1 f¯2 f¯3 f4 − f¯4 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 6 6M∞ 4 M1 f − f¯ . 63 By Assumption 3 (v), there exists a function κ2 such that E[κ2 (Y3 , Y2 , Y1 , Z)] < ∞ and l2 (y3 , y2 , y1 , z; f ) − l2 (y3 , y2 , y1 , z; f¯) 6 6M∞ 4 M1 f − f¯ κ2 (y3 , y2 , y1 , z). This shows Lipschitz continuity of the functional l2 (y3 , y2 , y1 , z; · ). Next, using Assumption 3 (i) yields ∫ ∫ l3 (f ) − l3 (f¯) 6 (f2 − f¯2 )f3 + f¯2 (f3 − f¯3 ) 6 f2 − f¯2 ∞ ∥f3 ∥1 + f¯2 1 f3 − f¯3 ∞ 6 2M f − f¯ , Similarly, using Assumption (i) and (ii) yields l4 (f ) − l4 (f¯) 6 f1 − f¯1 ∥f2 ∥2 ∥f3 ∥ + f¯1 f2 − f¯2 ∥f2 ∥ ∥f3 ∥ ∞ ∞ 1 ∞ ∞ ∞ 1 + f¯1 f¯2 f2 − f¯2 ∥f3 ∥ + f¯1 f¯2 ∥f2 ∥ f3 − f¯3 ∞ ∞ ∞ 1 ∞ ∞ 1 ∞ 2 6 4M M∞ f − f¯ . It follows that l3 and l4 are also Lipschitz continuous. These in particular implies H¨older continuity of the functionals l1 (y2 , y1 , z; · ), l2 (y3 , y2 , y1 , z; · ), l3 , and l4 , hence l(y3 , y2 , y1 , z, d2 , d1 ; · ). Therefore, Chen’s Condition 3.5M (ii) is satisfied with s = 1 by our assumptions.  9.4. Semiparametric Estimation. Section 5.2 proposed an estimator which treats the quadruple (fYt |Yt−1 U , fDt |Yt U , fY1 U , fZ|U ) of the density functions nonpara- metrically. In practice, it may be more useful to specify one or more of these densities semi-parametrically. For example, the dynamic model g is conventionally specified by g(y, u, ε) = αy + u + ε. By denoting the nonparametric density functions of Et by fE , we can represent the density fYt |Yt−1 U by fYt |Yt−1 U (y ′ | y, u) = fE (y ′ − αy − u). Consequently, a model is represented by (α, fE , fDt |Yt U , fY1 U , fZ|U ). For ease of writing, let this model be denoted by θ = (α, f˜1 , f2 , f3 , f4 ). Accordingly, write a set of such models by Θ = A × F˜1 × F2 × F3 × F4 64 Under these notations, Corollary 1 and Remark 9 characterize a sieve semipara- metric estimator θˆ of θ0 as the solution to 1∑ n max l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; θ) θ∈Θk(n) n i=1 for some sieve space Θk(n) = Ak(n) × F˜1,k1 (n) × F2,k2 (n) × F3,k3 (n) × F4,k4 (n) ⊂ Θ, where l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; θ) := 1 {Di1 = 1} · l1 (Yi2 , Yi1 , Zi ; θ) +1 {Di2 = Di1 = 1} · l2 (Yi3 , Yi2 , Yi1 , Zi ; θ) − l3 (θ) − l4 (θ), ∫ l1 (Yi2 , Yi1 , Zi ; θ) := log f˜1 (Yi2 − αYi1 − u)f2 (1 | Yi1 , u)f3 (Yi1 , u)f4 (Zi | u)dµ(u), ∫ l2 (Yi3 , Yi2 , Yi1 , Zi ; θ) := log f˜1 (Yi3 − αYi2 − u)f1 (Yi2 − αYi1 − u) × f2 (1 | Yi2 , u)f2 (1 | Yi1 , u)f3 (Yi1 , u)f4 (Zi | u)dµ(u), ∫ l3 (θ) := f2 (1 | y1 , u)f3 (y1 , u)dµ(y1 , u), and ∫ l4 (θ) := f˜1 (y2 − αy1 − u)f2 (1 | y2 , u)f2 (1 | y1 , u)f3 (y1 , u)dµ(y2 , y1 , u). The asymptotic distribution of α ˆ can be derived by following the method of Ai and Chen (2003), which was also used in Blundell, Chen, Kristensen (2007) and Hu and Schennach (2008). First, I introduce auxiliary notations. Define the path-wise derivative l(y3 , y2 , y1 , z, d2 , d1 ; θ(θ0 , r)) − l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) lθ′ 0 (y3 , y2 , y1 , z, d2 , d1 ; θ − θ0 ) = lim r→0 r where θ(θ0 , ·) : R → Θ denotes a path such that θ(θ0 , 0) = θ0 and θ(θ0 , 1) = θ. Simi- larly define the path-wise derivative with respect to each component of (f˜1 , f2 , f3 , f4 ) by d l(y3 , y2 , y1 , z, d2 , d1 ; f˜1 (θ0 , r)) − l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) lθ0 (y3 , y2 , y1 , z, d2 , d1 ; f˜1 − f˜10 ) = lim df˜1 r→0 r d l(y3 , y2 , y1 , z, d2 , d1 ; f2 (θ0 , r)) − l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) lθ (y3 , y2 , y1 , z, d2 , d1 ; f2 − f20 ) = lim df2 0 r→0 r d l(y3 , y2 , y1 , z, d2 , d1 ; f3 (θ0 , r)) − l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) lθ (y3 , y2 , y1 , z, d2 , d1 ; f3 − f30 ) = lim df3 0 r→0 r d l(y3 , y2 , y1 , z, d2 , d1 ; f4 (θ0 , r)) − l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) lθ (y3 , y2 , y1 , z, d2 , d1 ; f4 − f40 ) = lim df4 0 r→0 r where f˜1 (θ0 , ·) : R → F˜1 , f2 (θ0 , ·) : R → F2 , f3 (θ0 , ·) : R → F3 , and f4 (θ0 , ·) : R → F4 denote paths as before. 65 Recenter the set of parameters by Ω = Θ − θ0 so that [ ] ⟨v1 , v2 ⟩ = E lθ′ 0 (Y3 , Y2 , Y1 , Z, D2 , D1 ; v1 )lθ′ 0 (Y3 , Y2 , Y1 , Z, D2 , D1 ; v2 ) defines an inner product on Ω. Furthermore, by taking the closure Ω, we obtain a complete space Ω with respect to the topology induced by ⟨·, ·⟩, hence a Hilbert space (Ω, ⟨·, ·⟩). It can be written as Ω = R × W where W = F˜1 × F2 × F3 × F4 − (f˜10 , f20 , f30 , f40 ). Given these notations, define [( d d w∗ := (f˜1∗ , f2∗ , f3∗ , f4∗ ) = arg min E l(Y3 , Y2 , Y1 , Z, D2 , D1 ; θ0 ) − lθ0 (Y3 , Y2 , Y1 , Z, D2 , D1 ; f˜1 ) w∈W dα df˜1 d d − lθ (Y3 , Y2 , Y1 , Z, D2 , D1 ; f2 ) − lθ (Y3 , Y2 , Y1 , Z, D2 , D1 ; f3 ) df2 0 df3 0 )2 ] d − lθ (Y3 , Y2 , Y1 , Z, D2 , D1 ; f4 ) . df4 0 Given this w∗ , next define d d Φw∗ (y3 , y2 , y1 , z, d2 , d1 ) := l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) − lθ0 (y3 , y2 , y1 , z, d2 , d1 ; f˜1 ) dα df˜1 d d − lθ0 (y3 , y2 , y1 , z, d2 , d1 ; f2 ) − lθ (y3 , y2 , y1 , z, d2 , d1 ; f3 ) df2 df3 0 d − lθ0 (y3 , y2 , y1 , z, d2 , d1 ; f4 ). df4 The next assumption sets a moment condition. Assumption 4 (Bounded Second Moment). σ := E [Φw∗ (Y3 , Y2 , Y1 , Z, D2 , D1 )2 ] < ∞ s The mapping θ − θ0 7→ α − α0 is a linear functional on Ω. Since (Ω, ⟨·, ·⟩) is a Hilbert space, the Riesz Representation Theorem guarantees the existence of v ∗ ∈ Ω such that s(θ − θ0 ) = ⟨v ∗ , θ − θ0 ⟩ for all θ ∈ Θ under Assumption 4. Moreover, this representing vector has the explicit formula v ∗ = (σ −1 , −σ −1 w∗ ). Using Corollary 1 of √ √ ⟨ ∗ ⟩ ˆ Shen (1997) yields asymptotic distribution of N (α − α0 ) = N v , θ − θ0 , which is N (0, σ −1 ). 66 In order to invoke Shen’s corollary, a couple of additional notations need to be introduced. The remainder of the linear approximation is r(y3 , y2 , y1 , z, d2 , d1 ; θ − θ0 ) := l(y3 , y2 , y1 , z, d2 , d1 ; θ) − l(y3 , y2 , y1 , z, d2 , d1 ; θ0 ) −lθ′ 0 (y3 , y2 , y1 , z, d2 , d1 ; θ − θ0 ) A divergence measure is defined by 1 ∑ N K(θ0 , θ) := E [l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; θ0 ) − l(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ; θ)] . N i=1 Denote the empirical process induced by g by 1 ∑ N νn (g) := √ (g(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 ) − Eg(Yi3 , Yi2 , Yi1 , Zi , Di2 , Di1 )) N i=1 For a perturbation ϵn such that ϵn = o(n−1/2 ), let θ∗ (θ, ϵn ) = (1 − ϵn )θ + ϵn (u∗ + θ0 ) where u∗ = ±v ∗ . Lastly, Pn denote the projection Θ → Θn . The following high-level √ ⟨ ⟩ assumptions of Shen (1997) guarantees asymptotic normality of N v ∗ , θˆ − θ0 , or √ equivalently of N (α − α0 ). Assumption 5 (Regularity). (i) sup{θ∈Θn |∥θ−θ0 ∥6δ0 } n−1/2 νn (r(Y3 , Y2 , Y1 , Z, D2 , D1 ; θ − θ0 ) − r(Y3 , Y2 , Y1 , Z, D2 , D1 ; Pn (θ∗ (θ, ϵn )) − θ0 )) = Op (ϵ2n ). (ii) sup{θ∈Θn |0<∥θ−θ0 ∥6δn } [ ] [K(θ0 , Pn (θ∗ (θ, ϵn )) − K(θ0 , θ)] − 21 ∥θ∗ (θ, ϵn ) − θ0 ∥2 − ∥θ − θ0 ∥2 = O(ϵ2n ). (iii) sup{θ∈Θn |0<∥θ−θ0 ∥6δn } ∥θ∗ (θ, ϵn ) − Pn (θ∗ (θ, ϵn ))∥ = O(δn−1 ϵ2n ). (iv) sup{θ∈Θn |∥θ−θ0 ∥6δn } n−1/2 νn (lθ′ 0 ( · · · ; θ∗ (θ, ϵn ) − Pn (θ∗ (θ, ϵn )))) = Op (ϵ2n ). (v) sup{θ∈Θn |∥θ−θ0 ∥6δn } n−1/2 νn (lθ′ 0 ( · · · ; θ − θ0 )) = Op (ϵn ). Proposition 2 (Asymptotic Distribution of a Semiparametric Estimator). Sup- √ pose that Restrictions 1, 2, 3, and 4 and Assumptions 4 and 5 hold. Then, N (α − d α0 ) → N (0, σ −1 ). 10. Appendix: Special Cases and Generalizations of the Baseline Model 10.1. A Variety of Missing Observations. While the baseline model con- sidered in the paper induces a permanent dropout from data by a hazard selection, variants of the model can be encompassed as special cases under which the main 67 identification remains to hold. Specifically, we consider the following Classes 1 and 2 as special models of Class 3. Class 1 (Nonseparable Dynamic Panel Data Model).      Yt = g(Yt−1 , U, Et ) t = 2, · · · , T (State Dynamics)    FY1 U (Initial joint distribution of (Y1 , U ))     Z = ζ(U, W ) (Optional: nonclassical proxy of U )  Class 2 (Nonseparable Dynamic Panel Data Model with Missing Observations).      Yt = g(Yt−1 , U, Et ) t = 2, · · · , T (State Dynamics)      Dt = h(Yt , U, Vt ) t = 1, · · · , T − 1 (Selection)     FY1 U (Initial joint distribution of (Y1 , U ))      Z = ζ(U, W ) (Optional: nonclassical proxy of U ) where Yt is censored by the binary indicator Dt of sample selection as follows:   Yt is observed if Dt−1 = 1 or t = 1.  Y is unobserved if D = 0 and t > 1. t t−1  A representative example of this instantaneous selection is the Roy model such as h(y, u, v) = 1{E[π(g(y, u, Et+1 ), u)] > c(u, v)} where π measures payoffs and c measures costs. The following is the baseline model considered in the paper. Class 3 (Nonseparable Dynamic Panel Data Model with Hazards).      Yt = g(Yt−1 , U, Et ) t = 2, · · · , T (State Dynamics)      Dt = h(Yt , U, Vt ) t = 1, · · · , T − 1 (Hazard Model)     FY1 U (Initial joint distribution of (Y1 , U ))      Z = ζ(U, W ) (Optional: nonclassical proxy of U ) 68 where Dt = 0 induces a hazard of permanent dropout in the following manner:      Y1 is observed,    Y2 is observed if D1 = 1,     Y3 is observed if D1 = D2 = 1.  The present appendix section proves that identification of Class 3 implies identi- fication of Classes 1 and 2. The observable parts of the joint distributions in each of the three classes include (but are not limited to) the following: Class 1: Observe FY3 Y2 Y1 ZD2 D1 , FY2 Y1 ZD1 , FY3 Y1 ZD2 , and FY3 Y2 ZD2 Class 2: Observe FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1), FY2 Y1 ZD1 (·, ·, ·, 1), and FY3 Y1 ZD2 (·, ·, ·, 1) Class 3: Observe FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1) and FY2 Y1 ZD1 (·, ·, ·, 1) Since the selection variable Dt in Class 1 is not defined, we assume without loss of generality that it is degenerate at Dt = 1 in Class 1. The problem of identification under each class can be characterized by the well- definition of the following maps: ι Class 1: (FY3 Y2 Y1 ZD2 D1 , FY2 Y1 ZD1 , FY3 Y1 ZD2 , FY3 Y2 ZD2 ) 7→ 1 (g, FY1 U , ζ) ι Class 2: (FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1), FY2 Y1 ZD1 (·, ·, ·, 1), FY3 Y1 ZD2 (·, ·, ·, 1)) 7→ 2 (g, h, FY1 U , ζ) ι Class 3: (FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1), FY2 Y1 ZD1 (·, ·, ·, 1) 7→ 3 (g, h, FY1 U , ζ) The main identification result of this paper was to show well-definition of the map ι3 . Therefore, in order to argue that identification of Class 3 implies identification of Classes 1 and 2, it suffices to claim that the well-definition of ι3 implies well-definition of the maps ι1 and ι2 . First, note that the trivial projections π (FY3 Y2 Y1 ZD2 D1 , FY2 Y1 ZD1 , FY3 Y1 ZD2 , FY3 Y2 ZD2 ) 7→1 (FY3 Y2 Y1 ZD2 D1 , FY2 Y1 ZD1 , FY3 Y1 ZD2 ) and (FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1), FY2 Y1 ZD1 (·, ·, ·, 1), FY3 Y1 ZD2 (·, ·, ·, 1)) π 7→2 (FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1), FY2 Y1 ZD1 (·, ·, ·, 1)) 69 are well-defined. Second, by the construction of degenerate random variable Dt = 1 in Class 1, the map (FY3 Y2 Y1 ZD2 D1 , FY2 Y1 ZD1 , FY3 Y1 ZD2 , FY3 Y2 ZD2 ) κ 7→1 (FY3 Y2 Y1 ZD2 D1 (·, ·, ·, ·, 1, 1), FY2 Y1 ZD1 (·, ·, ·, 1), FY3 Y1 ZD2 (·, ·, ·, 1)) is well-defined in Class 1. Third, the trivial projection ρ (g, h, FY1 U , ζ) 7→ (g, FY1 U , ζ) is well-defined. Now, notice that ι1 = ρ ◦ ι3 ◦ κ1 ◦ π1 in Class 1, and ι2 = ι3 ◦ π2 in Class 2. Therefore, the well-definition of ι3 implies well-definition of ι1 and ι2 in particular. Therefore, identification of Class 3 implies identification of Classes 1 and 2. 10.2. Identification without a Nonclassical Proxy Variable. The main result of this paper assumed use of a nonclassical proxy variable Z. However, this use was mentioned to be optional, and one can substitute a slightly longer panel T = 6 for use of a proxy variable. In this section we show how the model (g, h, FY1 U ) can be identified from the joint distribution FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) that follows from T = 6 time periods of unbalanced panel data without additional information Z. Restriction 5 (Independence). (i) Exogeneity of Et : Et ⊥⊥ (U, Y1 , {Es }s 0 for all u ∈ U. (iv) Initial Heterogeneity: the integral operator Sy : L2 (FYt ) → L2 (FU ) defined by ∫ Sy ξ(u) = fY2 Y1 U D1 U (y, y ′ , u, 1) · ξ(y ′ )dy ′ is bounded and invertible. Lemma 5 (Independence). The following implications hold: (i) Restriction 5 (i) ⇒ E6 ⊥⊥ (U, Y1 , E2 , E3 , E4 , E5 , V1 , V2 , V3 , V4 , V5 ) ⇒ Y6 ⊥⊥ (Y1 , Y2 , Y3 , Y4 , D1 , D2 , D3 , D4 , D5 ) | (Y5 , U ). (ii) Restriction 5 (i) & (ii) ⇒ (E3 , E4 , E5 , V3 , V4 , V5 ) ⊥ ⊥ (U, Y1 , E2 , V1 , V2 ) ⇒ (Y3 , Y4 , Y5 , D3 , D4 , D5 ) ⊥⊥ (Y1 , D1 , D2 ) | (Y2 , U ). (iii) Restriction 5 (i) ⇒ E2 ⊥⊥ (U, Y1 , V1 ) ⇒ Y2 ⊥ ⊥ D1 | (Y1 , U ). (iv) Restriction 5 (ii) ⇒ V2 ⊥⊥ (U, Y1 , E2 , V1 ) ⇒ D2 ⊥ ⊥ (Y1 , D1 ) | (Y2 , U ). Proof. In order to prove the lemma, we use the following two properties of conditional independence: CI.1. A ⊥⊥ B implies A ⊥⊥ B | ϕ(B) for any Borel function ϕ. CI.2. A ⊥⊥ B | C implies A ⊥⊥ ϕ(B, C) | C for any Borel function ϕ. (i) First, applying CI.1 to E6 ⊥⊥ (U, Y1 , E2 , E3 , E4 , E5 , V1 , V2 , V3 , V4 , V5 ) and using the definition of g yield E6 ⊥⊥ (U, Y1 , E2 , E3 , E4 , E5 , V1 , V2 , V3 , V4 , V5 ) | (Y5 , U ). Next, applying CI.2 to this conditional independence and using the definitions of g and h yield E6 ⊥⊥ (Y1 , Y2 , Y3 , Y4 , D1 , D2 , D3 , D4 , D5 , Z) | (Y5 , U ). 71 Applying CI.2 again to this conditional independence and using the definition of g yield Y6 ⊥⊥ (Y1 , Y2 , Y3 , Y4 , D1 , D2 , D3 , D4 , D5 , Z) | (Y5 , U ). (ii) First, applying CI.1 to (E3 , E4 , E5 , V3 , V4 , V5 ) ⊥ ⊥ (U, Y1 , E2 , V1 , V2 ) and using the definition of g yield (E3 , E4 , E5 , V3 , V4 , V5 ) ⊥⊥ (U, Y1 , E2 , V1 , V2 ) | (Y2 , U ) Next, applying CI.2 to this conditional independence and using the definitions of g and h yield (E3 , E4 , E5 , V3 , V4 , V5 ) ⊥⊥ (Y1 , D1 , D2 ) | (Y2 , U ) Applying CI.2 again to this conditional independence and using the definition of g yield (Y3 , Y4 , Y5 , D3 , D4 , D5 ) ⊥ ⊥ (Y1 , D1 , D2 ) | (Y2 , U ) (iii) The proof is the same as that of Lemma 3 (ii). (iv) The proof is the same as that of Lemma 3 (iii).  Lemma 6 (Invariant Transition). (i) Under Restrictions 1 and 5 (i), FYt |Yt−1 U (y ′ | y, u) = FYt′ |Yt′ −1 U (y ′ | y, u) for all y ′ , y, u, t, t′ . (ii) Under Restrictions 1 and 5 (ii), FD2 |Y2 U (d | y, u) = FD1 |Y1 U (d | y, u) for all d, y, u. This lemma can be proved similarly to Lemma 4. Lemma 7 (Identification). Under Restrictions 1, 4, 5, & 6, (FY3 |Y2 U , FD2 |Y2 U , FY1 U ) is uniquely determined by FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1). Proof. Given fixed (y5 , y4 , z, y2 ), define the operators Ly5 ,y4 ,z,y2 : L2 (FYt ) → L2 (FYt ), Py5 : L2 (FU ) → L2 (FYt ), Qy5 ,y4 ,z,y2 : L2 (FU ) → L2 (FU ), Ry2 : L2 (FU ) → 72 L2 (FU ), Sy2 : L2 (FYt ) → L2 (FU ), Ty2 : L2 (FYt ) → L2 (FU ), and Ty′2 : L2 (FU ) → L2 (FU ) by ∫ (Ly5 ,y4 ,z,y2 ξ)(y6 ) = fY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 (y6 , y5 , y4 , z, y2 , y1 , 1, 1, 1, 1, 1) · ξ(y1 )dy1 , ∫ (Py5 ξ)(y3 ) = fY6 |Y5 U (y6 | y5 , u) · ξ(u)du, (Qy5 ,y4 ,z,y2 ξ)(u) = fY5 Y4 ZD5 D4 D3 |Y2 U (y5 , y4 , z, 1, 1, 1 | y2 , u) · ξ(u), (Ry2 ξ)(u) = fD2 |Y2 U (1 | y2 , u) · ξ(u), ∫ (Sy2 ξ)(u) = fY2 Y1 U D1 (y2 , y1 , u, 1) · ξ(y1 )dy1 , ∫ (Ty2 ξ)(u) = fY1 |Y2 U D2 D1 (y1 | y2 , u, 1, 1) · ξ(y1 )dy1 , (Ty′2 ξ)(u) = fY2 U D2 D1 (y2 , u, 1, 1) · ξ(u) respectively. Step 1: Uniqueness of FY6 |Y5 U and FY5 Y4 ZD5 D4 D3 |Y2 U ( ·, ·, ·, 1, 1, 1 | ·, ·) The kernel fY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 ( ·, y5 , y4 , z, y2 , ·, 1, 1, 1, 1, 1) of the integral operator Ly5 ,y4 ,z,y2 can be rewritten as fY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 (y6 , y5 , y4 , z, y2 , y1 , 1, 1, 1, 1, 1) ∫ = fY6 |Y5 Y4 ZY2 Y1 U D5 D4 D3 D2 D1 (y6 | y5 , y4 , z, y2 , y1 , u, 1, 1, 1, 1, 1) (21) ×fY5 Y4 ZD5 D4 D3 |Y2 Y1 U D2 D1 (y5 , y4 , z, 1, 1, 1 | y2 , y1 , u, 1, 1) ×fD2 |Y2 Y1 U D1 (1 | y2 , y1 , u, 1) · fY2 Y1 U D1 (y2 , y1 , u, 1) du But by Lemma 5 (i), (ii), and (iv), respectively, Restriction 5 implies that fY6 |Y5 Y4 ZY2 Y1 U D5 D4 D3 D2 D1 (y6 | y5 , y4 , z, y2 , y1 , u, 1, 1, 1, 1, 1) = fY6 |Y5 U (y5 | y5 , u), fY5 Y4 ZD5 D4 D3 |Y2 Y1 U D2 D1 (y5 , y4 , z, 1, 1, 1 | y2 , y1 , u, 1, 1) = fY5 Y4 ZD5 D4 D3 |Y2 U (y5 , y4 , z, 1, 1, 1 | y2 , u), fD2 |Y2 Y1 U D1 (1 | y2 , y1 , u, 1) = fD2 |Y2 U (1 | y2 , u). 73 Equation (21) thus can be rewritten as fY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 (y6 , y5 , y4 , z, y2 , y1 , 1, 1, 1, 1, 1) ∫ = fY6 |Y5 U (y6 | y5 , u) · fY5 Y4 ZD5 D4 D3 |Y2 U (y5 , y4 , z, 1, 1, 1 | y2 , u) ×fD2 |Y2 U (1 | y2 , u) · fY2 Y1 U D1 (y2 , y1 , u, 1) du But this implies that the integral operator Ly4 ,y3 ,y2 is written as the operator compo- sition Ly5 ,y4 ,z,y2 = Py5 Qy5 ,y4 ,z,y2 Ry2 Sy2 . Restriction 6 (i), (ii), (iii), and (iv) imply that the operators Py5 , Qy5 ,y4 ,z,y2 , Ry2 , and Sy2 are invertible, respectively. Hence so is Ly5 ,y4 ,z,y2 . Using the two values {0, 1} of Z, form the product Ly5 ,y4 ,1,y2 L−1 −1 y5 ,y4 ,0,y2 = Py5 Qy4 ,1/0,y2 Py5 where Qy4 ,1/0,y2 = Qy5 ,y4 ,1,y2 Q−1 y5 ,y4 ,0,y2 is the multiplication operator with proxy odds defined by fY5 Y4 ZD5 D4 D3 |Y2 U (y5 , y4 , 1, 1, 1, 1 | y2 , u) (Qy4 ,1/0,y2 ξ)(u) = ξ(u) fY5 Y4 ZD5 D4 D3 |Y2 U (y5 , y4 , 0, 1, 1, 1 | y2 , u) fY5 |Y4 U (y5 | y4 , u) · fY4 ZD5 D4 D3 |Y2 U (y4 , 1, 1, 1, 1 | y2 , u) = ξ(u) fY5 |Y4 U (y5 | y4 , u) · fY4 ZD5 D4 D3 |Y2 U (y4 , 0, 1, 1, 1 | y2 , u) fY4 ZD5 D4 D3 |Y2 U (y4 , 1, 1, 1, 1 | y2 , u) = ξ(u). fY4 ZD5 D4 D3 |Y2 U (y4 , 0, 1, 1, 1 | y2 , u) Note the invariance of this operator in y5 , hence the notation. By Restriction 6 (ii), the operator Ly5 ,y4 ,1,y2 L−1 −1 y5 ,y4 ,0,y2 is bounded. The expression Ly5 ,y4 ,1,y2 Ly5 ,y4 ,0,y2 = Py5 Qy4 ,1/0,y2 Py−1 5 thus allows unique eigenvalue-eigenfunction decomposition as in the proof of Lemma 2. The distinct proxy odds as in Restriction 6 (ii) guarantee distinct eigenvalues and single dimensionality of the eigenspace associated with each eigenvalue. Within each of the single-dimensional eigenspace is a unique eigenfunction pinned down by L1 - normalization because of the unity of integrated densities. The eigenvalues λ(u) yield the multiplier of the operator Qy4 ,1/0,y2 , hence λy4 ,y2 (u) = fY4 ZD5 D4 D3 |Y2 U (y4 , 1, 1, 1, 1 | 74 y2 , u)/fY4 ZD5 D4 D3 |Y2 U (y4 , 0, 1, 1, 1 | y2 , u). This proxy odds in turn identifies the func- tion fY4 ZD5 D4 D3 |Y2 U (y4 , ·, 1, 1, 1 | y2 , u) since Z is binary. The corresponding normal- ized eigenfunctions are the kernels of the integral operator Py5 , hence fY6 |Y5 U ( · | y5 , u). Lastly, Restriction 4 facilitates unique ordering of the eigenfunctions fY6 |Y5 U ( · | y5 , u) by the distinct concrete values of u = λy4 ,y2 (u). This is feasible because the eigenval- ues λy4 ,y2 (u) = fY4 ZD5 D4 D3 |Y2 U (y4 , 1, 1, 1, 1 | y2 , u)/fY4 ZD5 D4 D3 |Y2 U (y4 , 0, 1, 1, 1 | y2 , u) are invariant from y5 . That is, eigenfunctions fY6 |Y5 U ( · | y5 , u) of the operator Ly5 ,y4 ,1,y2 L−1 y5 ,y4 ,0,y2 across different y5 can be uniquely ordered in u invariantly from y5 by the common set of ordered distinct eigenvalues u = λy4 ,y2 (u). Therefore, FY6 |Y5 U and FY4 ZD5 D4 D3 |Y2 U (y4 , ·, 1, 1, 1 | y2 , u) are uniquely determined by the joint distribution FY6 Y5 Y4 ZY2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1), which in turn is uniquely determined by the FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1). The multiplier of the operator Qy5 ,y4 ,z,y2 is of the form fY5 Y4 ZD5 D4 D3 |Y2 U (y5 , y4 , z, 1, 1, 1 | y2 , u) = fY5 |Y4 U (y5 | y4 , u) · fY4 ZD5 D4 D3 |Y2 U (y4 , z, 1, 1, 1 | y2 , u) = fY6 |Y5 U (y5 | y4 , u) · fY4 ZD5 D4 D3 |Y2 U (y4 , z, 1, 1, 1 | y2 , u) by Lemma 6 (i), where the right-hand side object has been identified. Consequently, the operators Py5 and Qy5 ,y4 ,z,y2 are uniquely determined for each combination of y5 , y4 , z, y2 . Step 2: Uniqueness of FY2 Y1 U D1 ( · , · , · , 1) By Lemma 5 (iii), Restriction 5 implies fY2 |Y1 ZU D1 (y ′ | y, z, u, 1) = fY2 |Y1 U (y ′ | y, u). Using this equality, write the density of the observed FY2 Y1 ZD1 ( · , · , · , 1) as ∫ fY2 Y1 D1 (y2 , y1 , 1) = fY2 |Y1 U D1 (y2 | y1 , u, 1)fY1 U D1 (y1 , u, 1)du ∫ (22) = fY2 |Y1 U (y2 | y1 , u)fY1 U D1 (y1 , u, 1)du By Lemma 4 (i), FY6 |Y5 U (y ′ | y, u) = FY2 |Y1 U (y ′ | y, u) for all y ′ , y, u. Therefore, we can write the operator Py as ∫ ∫ (Py1 ξ)(y2 ) = fY6 |Y5 U (y2 | y1 , u) · ξ(u)du = fY2 |Y1 U (y2 | y1 , u) · ξ(u)du. 75 With this operator notation, it follows from (22) that fY2 Y1 D1 ( · , y1 , 1) = Py1 fY1 U D1 (y1 , · , 1). By Restriction 6 (i) and (ii), this operator equation can be solved for fY1 U D1 (y, · , 1) as (23) fY1 U D1 (y1 , · , 1) = Py−1 1 fY2 Y1 D1 ( · , y1 , 1) Recall that Py was shown in Step 1 to be uniquely determined by the observed FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1). The function fY2 Y1 D1 ( · , y, 1) is also uniquely determined by the observed joint distribution FY2 Y1 D1 ( · , · , 1) up to null sets. Therefore, (22) shows that fY1 U D1 ( · , · , 1) is uniquely determined by the pair of the observed joint distributions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1). Using the solution to the above inverse problem, we can write the kernel of the operator Sy2 as fY2 Y1 U D1 (y2 , y1 , u, 1) = fY2 |Y1 U D1 (y2 | y1 , u, 1) · fY1 U D1 (y1 , u, 1) = fY2 |Y1 U (y2 | y1 , u) · fY1 U D1 (y1 , u, 1) = fY6 |Y5 U (y2 | y1 , u) · fY1 U D1 (y1 , u, 1) = fY6 |Y5 U (y2 | y1 , u) · [Py−1 1 fY2 Y1 D1 ( · , y1 , 1)](u) where the second equality follows from Lemma 5 (iii), the third equality follows from Lemma 4 (i), and the forth equality follows from (23). Since fY6 |Y5 U was shown in Step 1 to be uniquely determined by the observed joint distribution FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and [Py−1 1 fY2 Y1 D1 ( · , y1 , 1)] was shown in the previous paragraph to be uniquely determined for each y1 by the observed joint distributions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1), it follows that fY2 Y1 U D1 ( · , · , · , 1) too is uniquely determined by the observed joint distributions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1). Equivalently, the operator Sy2 is uniquely determined for each y2 . 76 Step 3: Uniqueness of FY1 |Y2 U D2 D1 ( · | · , · , 1, 1) This step is the same as Step 3 in the proof of Lemma 2, except that Ly,z and Q1/0 are replaced by Ly5 ,y4 ,z,y2 and Qy4 ,1/0,y2 , respectively, which were defined in Step 1 of this proof. FY1 |Y2 U D2 D1 ( · | · , · , 1, 1) or the operator Ty is uniquely determined by the observed joint distribution FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1). Step 4: Uniqueness of FY2 U D2 D1 ( · , · , 1, 1) This step is the same as Step 4 in the proof of Lemma 2. FY2 U D2 D1 ( · , · , 1, 1) or the auxiliary operator Ty′ is uniquely determined by the pair of the observed joint distri- butions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Step 5: Uniqueness of FD2 |Y2 U (1 | · , · ) This step is the same as Step 5 in the proof of Lemma 2. FD2 |Y2 U (1 | · , · ) or the auxiliary operator Ty′ is uniquely determined by the pair of the observed joint distri- butions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Step 6: Uniqueness of FY1 U Recall from Step 2 that fY2 Y1 U D1 ( · , · , · , 1) is uniquely determined by the ob- served joint distributions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1). We can write fY2 Y1 U D1 (y2 , y1 , u, 1) = fY2 |Y1 U D1 (y2 | y1 , u, 1)fD1 |Y1 U (1 | y1 , u)fY1 U (y1 , u) = fY2 |Y1 U (y2 | y1 , u)fD1 |Y1 U (1 | y1 , u)fY1 U (y1 , u) = fY6 |Y5 U (y2 | y1 , u)fD2 |Y2 U (1 | y1 , u)fY1 U (y1 , u), where the second equality follows from Lemma 5 (iii), and the third equality follows from Lemma 4 (i) and (ii). For a given (y1 , u), there must exist some y2 such that fY6 |Y5 U (y2 | y1 , u) > 0 by a property of conditional density functions. Moreover, Restriction 6 (iii) requires that fD2 |Y2 U (1 | y1 , u) > 0 for a given y1 for all u. Therefore, 77 for such a choice of y2 , we can write fY2 Y1 U D1 (y2 , y1 , u, 1) fY1 U (y1 , u) = fY6 |Y5 U (y2 | y1 , u)fD2 |Y2 U (1 | y1 , u) Recall that fY6 |Y5 U ( · | · , · ) was shown in Step 1 to be uniquely determined by the observed joint distribution FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1), fY2 Y1 U D1 ( · , · , · , 1) was shown in Step 2 to be uniquely determined by the pair of the observed joint distributions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1), and fD2 |Y2 U (1 | · , · ) was shown in Step 5 to be uniquely determined by the observed joint distributions FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1). Therefore, it follows that the initial joint density fY1 U is uniquely determined by the observed FY6 Y5 Y4 Y3 Y2 Y1 D5 D4 D3 D2 D1 ( ·, ·, ·, ·, ·, ·, 1, 1, 1, 1, 1) and FY2 Y1 D1 ( · , · , 1).  We next discuss an identification-preserving criterion analogously to Corollary 1. Let F denote the set of all the admissible model representations F = {(FYt |Yt−1 U , FDt |Yt U , FY1 U , FZ|U ) | (g, h, FY1 U , ζ) satisfies Restrictions 1, 4, 5, and 6}. A natural consequence of the main identification result of Lemma 7 is that the true model (FY∗t |Yt−1 U , FD∗ t |Yt U , FY∗1 U , FZ|U ∗ ) is the unique maximizer of the following criterion. Corollary 2 (Constrained Maximum Likelihood). If the quadruple for the true model (FY∗t |Yt−1 U , FD∗ t |Yt U , FY∗1 U , FZ|U ∗ ) is an element of F, then it is the unique solution to [ ∫ ] ( max ) c1 E log fYt |Yt−1 U (Y2 | Y1 , u)fDt |Yt U (1 | Y1 , u)fY1 U (Y1 , u)dµ(u) D1 = 1 + FY |Y ,FD |Y U ,FY1 U ,FZ|U ∈F t t−1 U t t [ ∫ ∏ ] 5 c2 E log fYt |Yt−1 U (Ys+1 | Ys , u)fDt |Yt U (1 | Ys , u)fY1 U (Y1 , u)dµ(u) D5 = · · · = D1 = 1 s=1 for any c1 , c2 > 0 subject to ∫ fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y1 , u) = fD1 (1) and ∫ ∏ 5 fYt |Yt−1 U (ys | ys−1 , u)fDt |Yt U (1 | ys , u)fDt |Yt U (1 | y1 , u)fY1 U (y1 , u)dµ(y2 , y1 , u) s=2 = fD5 D4 D3 D2 D1 (1, 1, 1, 1, 1). 78 10.3. Models with Higher-Order Lags. The model discussed in this paper can be extended to the following model    Yt = g(Yt−1 , · · · , Yt−τ , U, Et )  for t = τ + 1, · · · , T      Dt = h(Yt , · · · , Yt−τ +1 , U, Vt ) for t = τ, · · · , T − 1     FYτ ···Y1 U Dτ −1 ···D1 (· · · , · , (1))      Z = ζ(U, W ) where g is a τ -th order Markov process with heterogeneity U , and the attrition model depends on the past as well as the current state. In this set up, we can observe the parts, FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 ZDτ ···D1 (· · · , · , (1)), of the joint dis- tributions if T = τ + 2. I claim that T = τ + 2 suffices for identification. In other words, it can be shown that (g, h, FYτ ···Y1 U Dτ −1 ···D1 (· · · , · , (1)), ζ) is uniquely determined by FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 ZDτ ···D1 (· · · , · , (1)) up to equivalence classes. To this end, we replace Restrictions 2 and 3 by the following restrictions. Restriction 7 (Independence). (i) Exogeneity of Et : Et ⊥⊥ (U, {Ys }τs=1 , {Ds }τs=1 , {Es }s 0 for all u ∈ U. (iv) Initial Heterogeneity: the integral operator S(y) : L2 (FYt ) → L2 (FU ) defined by ∫ S(y) ξ(u) = fYτ +1 ···Y2 Y1 U Dτ ···D1 ((y), y ′ , u, (1)) · ξ(y ′ )dy ′ is bounded and invertible. Lemma 8 (Independence). The following implications hold: (i) Restriction 7 (i) ⇒ Yτ +2 ⊥⊥ (Y1 , {Dt }τt=1 +1 , Z) | ({Yt }τt=2 +1 , U ). (ii) Restriction 7 (i) ⇒ Yτ +1 ⊥⊥ ({Dt }τt=1 , Z) | ({Yt }τt=1 , U ). 79 (iii) Restriction 7 (ii) ⇒ Dτ +1 ⊥⊥ (Y1 , {Dt }τt=1 ) | ({Yt }τt=2 +1 , U ). (iv) Restriction 7 (iii) ⇒ Z ⊥⊥ ({Yt }τt=1 +1 , {Dt }τt=1 +1 ) | U. Proof. As in the proof of Lemma 3, we use the following two properties of conditional independence: CI.1. A ⊥⊥ B implies A ⊥⊥ B | ϕ(B) for any Borel function ϕ. CI.2. A ⊥⊥ B | C implies A ⊥⊥ ϕ(B, C) | C for any Borel function ϕ. (i) First, note that Restriction 2 (i) Eτ +2 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 , W ) to- gether with the structural definition Z = ζ(U, W ) implies the independence restriction Eτ +2 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 , Vτ , Z). Applying CI.1 to this independence rela- tion Eτ +2 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 , Z) yields Eτ +2 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 , Z) | (g(Yτ , · · · , Y1 , U, Eτ +1 ), {Yt }τt=2 , U ). Since Yτ +1 = g(Yτ , · · · , Y1 , U, Eτ +1 ), this conditional independence relation can be rewritten as Eτ +2 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 , Z) | ({Yt }τt=2 +1 , U ). Next, applying CI.2 to this conditional independence yields Eτ +2 ⊥⊥ (Y1 , h(Yτ +1 , · · · , Y2 , U, Vτ +1 ), {Dt }τt=1 , Z) | ({Yt }τt=2 +1 , U ). Since Dτ +1 = h(Yτ +1 , · · · , Y2 , U, Vτ +1 )), this conditional independence can be rewritten as Eτ +2 ⊥⊥ (Y1 , {Dt }τt=1 +1 , Z) | ({Yt }τt=2 +1 , U ). Lastly, applying CI.2 again to this conditional independence yields g(Yτ +1 , · · · , Y2 , U, Eτ +2 ) ⊥⊥ (Y1 , {Dt }τt=1 +1 , Z) | ({Yt }τt=2 +1 , U ). Since Yτ +2 = g(Yτ +1 , · · · , Y2 , U, Eτ +2 ), this conditional independence relation can be rewrit- ten as Yτ +2 ⊥⊥ (Y1 , {Dt }τt=1 +1 , Z) | ({Yt }τt=2 +1 , U ). (ii) Note that Restriction 2 (i) Eτ +1 ⊥ ⊥ (U, {Yt }τt=1 , {Dt }τt=1 , W ) together with the struc- tural definition Z = ζ(U, W ) implies Eτ +1 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Z). Applying CI.1 to this independence relation yields Eτ +1 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Z) | ({Yt }τt=1 , U ). Next, applying CI.2 to this conditional independence yields g(Yτ , · · · , Y1 , U, Eτ +1 ) ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Z) | ({Yt }τt=1 , U ). 80 Since Yτ +1 = g(Yτ , · · · , Y1 , U, Eτ +1 ), this conditional independence relation can be rewritten as Yτ +1 ⊥ ⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Z) | (D1 , U ). Lastly, applying CI.2 again to this conditional independence yields Yτ +1 ⊥⊥ ({Dt }τt=1 , Z) | ({Yt }τt=1 , U ). −1 (iii) Applying CI.1 to Restriction 2 (ii) Vτ +1 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ ) yields −1 Vτ +1 ⊥⊥ (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ ) | (g(Yτ , · · · , Y1 , U, Eτ +1 ), {Yt }τt=2 , U ). Since Yτ +1 = g(Yτ , · · · , Y1 , U, Eτ +1 ) by construction, it can be rewritten as Vτ +1 ⊥⊥ −1 (U, {Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ ) | ({Yt }τt=2 +1 , U ). Next, applying CI.2 to this conditional inde- pendence yields −1 Vτ +1 ⊥⊥ (Y1 , h(Yτ , · · · , Y1 , U, Vτ ), {Dt }τt=1 ) | ({Yt }τt=2 +1 , U ). Since Dτ = h(Yτ , · · · , Y1 , U, Vτ ), it can be rewritten as Vτ +1 ⊥⊥ (Y1 , {Dt }τt=1 ) | ({Yt }τt=2 +1 , U ). Lastly, applying CI.2 to this conditional independence yields h(Yτ +1 , · · · , Y2 , U, V2 ) ⊥⊥ (Y1 , {Dt }τt=1 ) | ({Yt }τt=2 +1 , U ). Since Dτ +1 = h(Yτ +1 , · · · , Y2 , U, Vτ +1 ), it can be rewritten as Dτ +1 ⊥⊥ (Y1 , {Dt }τt=1 ) | ({Yt }t=2 τ +1 , U ). (iv) Note that Restriction 2 (iii) W ⊥ ⊥ ({Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 ) together with the structural definition Z = ζ(U, W ) yields Z ⊥⊥ ({Yt }τt=1 , {Dt }τt=1 , Eτ +1 , Vτ +1 ) | U . Applying CI.2 to this conditional independence relation yields Z ⊥⊥ (g(Yτ , · · · , Y1 , U, Eτ +1 ), {Yt }τt=1 , h(g(Yτ , · · · , Y1 , U, Eτ +1 ), U, Vτ +1 ), {Dt }τt=1 ) | U. Since Yτ +1 = g(Yτ , · · · , Y1 , U, Eτ +1 ) and Dτ +1 = h(Yτ +1 , U, Vτ +1 ), this conditional indepen- dence can be rewritten as Z ⊥⊥ ({Yt }τt=1 +1 , {Dt }τt=1 +1 ) | U.  Lemma 9 (Invariant Transition). (i) Under Restrictions 1 and 7 (i), FYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) = FYτ +1 |Yτ ···Y1 U (y ′ | (y), u) for all y ′ , (y), u. (ii) Under Restrictions 1 and 7 (ii), FDτ +1 |Yτ +1 ···Y2 U (d | (y), u) = FD1 |Yτ ···Y1 U (d | (y), u) for all d, (y), u. Proof. (i) First, note that Restriction 7 (i) implies Eτ +2 ⊥ ⊥ (U, Yτ , · · · , Y1 , Eτ +1 ), which in turn implies that Eτ +2 ⊥⊥ (g(Yτ , · · · , Y1 , U, Eτ +1 ), Yτ , · · · , Y2 , U ), hence 81 Eτ +2 ⊥⊥ (Yτ +1 , · · · , Y2 , U ). Second, Restriction 7 (i) in particular yields Eτ +1 ⊥⊥ (Yτ , · · · , Y1 , U ). Using these two independence results, we obtain FYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) = Pr[g((y), u, Eτ +2 ) ≤ y ′ | (Yτ +1 , · · · , Y2 ) = (y), U = u] = Pr[g((y), u, Eτ +2 ) ≤ y ′ ] = Pr[g((y), u, Eτ +1 ) ≤ y ′ ] = Pr[g((y), u, Eτ +1 ) ≤ y ′ | (Yτ , · · · , Y1 ) = (y), U = u] = FYτ +1 |Yτ ···Y1 U (y ′ | (y), u) for all y ′ , (y), u, where the second equality follows from Eτ +2 ⊥⊥ (Yτ +1 , · · · , Y2 , U ), the third equality follows from identical distribution of Et by Restriction 1, and the forth equality follows from Eτ +1 ⊥⊥ (Yτ , · · · , Y1 , U ). (ii) Restriction 7 (ii) implies that Vτ +1 ⊥⊥ (g(Yτ +1 , · · · , Y1 , U, Eτ +1 ), Yτ , · · · , Y1 , U ), hence Vτ +1 ⊥ ⊥ (Yτ +1 , · · · , Y2 , U ). Restriction 7 (ii) also implies Vτ ⊥⊥ (Yτ , · · · , Y1 , U ). Using these two independence results, we obtain FDτ +1 |Yτ +1 ···Y2 U (d | (y), u) = Pr[h((y), u, Vτ +1 ) ≤ d | (Yτ +1 , · · · , Y2 ) = (y), U = u] = Pr[h((y), u, Vτ +1 ) ≤ d] = Pr[h((y), u, Vτ ) ≤ d] = Pr[h((y), u, Vτ ) ≤ d | (Yτ , · · · , Y1 ) = (y), U = u] = FD1 |Yτ ···Y1 U (d | (y), u) for all d, (y), u, where the second equality follows from Vτ +1 ⊥ ⊥ (Yτ +1 , · · · , Y2 , U ), the third equality follows from identical distribution of Vt from Restriction 1, and the forth equality follows from Vτ ⊥⊥ (Yτ , · · · , Y1 , U ).  Lemma 10 (Identification). Under Restrictions 1, 4, 7, and 8, the quadruple (FYτ +2 |Yτ +1 ···Y2 U , FDτ +1 |Yτ +1 ···Y2 U , FYτ ···Y1 U Dτ −1 ···D1 (· · · , · , (1)), FZ|U ) is uniquely deter- mined by FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and FYτ +1 ···Y1 ZDτ ···D1 (· · · , · , (1)). 82 Proof. Given fixed (y) and z, define the operators L(y),z : L2 (FYt ) → L2 (FYt ), P(y) : L2 (FU ) → L2 (FYt ), Qz : L2 (FU ) → L2 (FU ), R(y) : L2 (FU ) → L2 (FU ), S(y) : ′ L2 (FYt ) → L2 (FU ), T(y) : L2 (FYt ) → L2 (FU ), and T(y) : L2 (FU ) → L2 (FU ) by ∫ (L(y),z ξ)(yτ +2 ) = fYτ +2 ···Y1 ZDτ +1 ···D1 (yτ +2 , (y), y1 , z, (1)) · ξ(y1 )dy1 , ∫ (P(y) ξ)(yτ +2 ) = fYτ +2 |Yτ +1 ···Y2 U (yτ +2 | (y), u) · ξ(u)du, (Qz ξ)(u) = fZ|U (z | u) · ξ(u), (R(y) ξ)(u) = fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) · ξ(u), ∫ (S(y) ξ)(u) = fYτ +1 ···Y2 Y1 U Dτ ···D1 ((y), y1 , u, (1)) · ξ(y1 )dy1 , ∫ (T(y) ξ)(u) = fY1 |Yτ +1 ···Y2 U Dτ +1 D1 (y1 | (y), u, (1)) · ξ(y1 )dy1 , ′ (T(y) ξ)(u) = fYτ +1 ···Y2 U Dτ +1 ···D1 ((y), u, (1)) · ξ(u) respectively. The operators L(y),z , P(y) , S(y) , and T(y) are integral operators whereas ′ Qz , R(y) , and T(y) are multiplication operators. Note that L(y),z is identified from observed joint distribution FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)). Step 1: Uniqueness of FYτ +2 |Yτ +1 ···Y2 U and FZ|U The kernel fYτ +2 ···Y1 ZDτ +1 ···D1 ( · , (y), · , z, (1)) of the integral operator L(y),z can be rewritten as ∫ fYτ +2 ···Y1 ZDτ +1 ···D1 (yτ +2 , (y), y1 , z, (1)) = fYτ +2 |Yτ +1 ···Y1 ZU Dτ +1 ···D1 (yτ +2 | (y), y1 , z, u, (1)) ×fZ|Yτ +1 ···Y1 U Dτ +1 ···D1 (z | (y), y1 , u, (1)) ×fDτ +1 |Yτ +1 ···Y1 U Dτ ···D1 (1 | (y), y1 , u, (1)) (24) ×fYτ +1 ···Y1 U Dτ ···D1 ((y), y1 , u, (1)) du But by Lemma 8 (i), (iv), and (iii), respectively, Restriction 7 implies that fYτ +2 |Yτ +1 ···Y1 ZU Dτ +1 ···D1 (yτ +2 | (y), y1 , z, u, (1)) = fYτ +2 |Yτ +1 ···Y2 U (yτ +2 | (y), u), fZ|Yτ +1 ···Y1 U Dτ +1 ···D1 (z | (y), y1 , u, (1)) = fZ|U (z | u), fDτ +1 |Yτ +1 ···Y1 U Dτ ···D1 (1 | (y), y1 , u, (1)) = fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u). 83 Equation (24) thus can be rewritten as ∫ fYτ +2 ···Y1 ZDτ +1 ···D1 (yτ +2 , (y), y1 , z, (1)) = fYτ +2 |Yτ +1 ···Y2 U (yτ +2 | (y), u) · fZ|U (z | u) ×fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) ×fYτ +1 ···Y1 U Dτ ···D1 ((y), y1 , u, (1)) du But this implies that the integral operator Ly,z is written as the operator composition L(y),z = P(y) Qz R(y) S(y) . Restriction 8 (i), (ii), (iii), and (iv) imply that the operators P(y) , Qz , R(y) , and S(y) are invertible, respectively. Hence so is L(y),z . Using the two values {0, 1} of Z, form the product L(y),1 L−1 −1 y,0 = P(y) Q1/0 P(y) where Qz/z′ := Qz Q−1 −1 z ′ . By Restriction 8 (ii), the operator L(y),1 L(y),0 is bounded. The expression L(y),1 L−1 −1 (y),0 = P(y) Q1/0 P(y) thus allows unique eigenvalue-eigenfunction decomposition. The distinct proxy odds as in Restriction 8 (ii) guarantee distinct eigenvalues and single dimensionality of the eigenspace associated with each eigenvalue. Within each of the single-dimensional eigenspace is a unique eigenfunction pinned down by L1 -normalization because of the unity of integrated densities. The eigenvalues λ(u) yield the multiplier of the operator Q1/0 , hence λ(u) = fZ|U (1 | u)/fZ|U (0 | u). This proxy odds in turn identifies fZ|U ( · | u) since Z is binary. The correspond- ing normalized eigenfunctions are the kernels of the integral operator P(y) , hence fYτ +2 |Yτ +1 ···Y2 U ( · | (y), u). Lastly, Restriction 4 facilitates unique ordering of the eigenfunctions fYτ +2 |Yτ +1 ···Y2 U ( · | (y), u) by the distinct concrete values of u = λ(u). This is feasible because the eigenvalues λ(u) = fZ|U (1 | u)/fZ|U (0 | u) are invariant from (y). That is, eigenfunctions fYτ +2 |Yτ +1 ···Y2 U ( · | (y), u) of the operator L(y),1 L−1 (y),0 across different (y) can be uniquely ordered in u invariantly from (y) by the common set of ordered distinct eigenvalues u = λ(u). 84 Therefore, FYτ +2 |Yτ +1 ···Y2 U and FZ|U are uniquely determined by the observed joint distribution FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)). Equivalently, the operators P(y) and Qz are uniquely determined for each (y) and z, respectively. Step 2: Uniqueness of FYτ +1 ···Y1 U Dτ ···D1 (· · · , · , (1)) By Lemma 8 (ii), Restriction 7 implies fYτ +1 |Yτ ···Y1 U Dτ ···D1 (y ′ | (y), u, (1)) = fYτ +1 |Yτ ···Y1 U (y ′ | (y), u). Using this equality, write the density of the observed joint distribution FYτ +1 ···Y1 Dτ ···D1 (· · · , (1)) as ∫ ′ fYτ +1 ···Y1 Dτ ···D1 (y , (y), (1)) = fYτ +1 |Yτ ···Y1 U Dτ ···D1 (y ′ | (y), u, (1)) ×fYτ ···Y1 U Dτ ···D1 ((y), u, (1))du ∫ = fYτ +1 |Yτ ···Y1 U (y ′ | (y), u) (25) ×fYτ ···Y1 U Dτ ···D1 ((y), u, (1))du By Lemma 9 (i), FYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) = FYτ +1 |Yτ ···Y1 U (y ′ | (y), u) for all y ′ , (y), u. Therefore, we can write the operator P(y) as ∫ ∫ (P(y) ξ)(y ′ ) = fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) · ξ(u)du = fYτ +1 |Yτ ···Y1 U (y ′ | (y), u) · ξ(u)du. With this operator notation, it follows from (25) that fYτ +1 ···Y1 Dτ ···D1 ( · , (y), (1)) = P(y) fYτ ···Y1 U Dτ ···D1 ((y), · , (1)). By Restriction 8 (i) and (ii), this operator equation can be solved for the function fYτ ···Y1 U Dτ ···D1 ((y), · , (1)) as −1 (26) fYτ ···Y1 U Dτ ···D1 ((y), · , (1)) = P(y) fYτ +1 ···Y1 Dτ ···D1 ( · , (y), (1)) Recall that P(y) was shown in Step 1 to be uniquely determined by the observed joint distribution FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)). The function fYτ +1 ···Y1 Dτ ···D1 ( · , (y), (1)) is also uniquely determined by the observed joint distribution fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). Therefore, (25) shows that fYτ ···Y1 U Dτ ···D1 (· · · , · , (1)) is uniquely determined by the observed joint distributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). 85 Using the solution to the above inverse problem, we can write the kernel of the operator S(y) as fYτ +1 ···Y1 U Dτ ···D1 (y ′ , (y), u, (1)) = fYτ +1 |Yτ ···Y1 U Dτ ···D1 (y ′ | (y), u, (1)) · fYτ ···Y1 U Dτ ···D1 ((y), u, (1)) = fYτ +1 |Yτ ···Y1 U (y ′ | (y), u) · fYτ ···Y1 U Dτ ···D1 ((y), u, (1)) = fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) · fYτ ···Y1 U Dτ ···D1 ((y), u, (1)) = fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) −1 ×[P(y) fYτ +1 ···Y1 ZDτ ···D1 ( · , (y), z, (1))](u) where the second equality follows from Lemma 8 (ii), the third equality follows from Lemma 9 (i), and the forth equality follows from (26). Since fYτ +2 |Yτ +1 ···Y2 U was shown in Step 1 to be uniquely determined by the observed joint distribution −1 FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and [P(y) fYτ +1 ···Y1 ZDτ ···D1 ( · , (y), z, (1))] was shown in the previous paragraph to be uniquely determined for each y by the observed joint dis- tributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)), it follows that fYτ +1 ···Y1 U Dτ ···D1 (· · · , · , (1)) too is uniquely determined by the observed joint distri- butions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). Equivalently, the operator S(y) is uniquely determined for each (y). Step 3: Uniqueness of FY1 |Yτ +1 ···Y2 U Dτ +1 ···D1 ( · | · · · , · , (1)) ′ First, note that the kernel of the composite operator T(y) T(y) can be written as fYτ +1 ···Y2 U Dτ +1 ···D1 ((y), u, (1)) · fY1 |Yτ +1 ···Y2 U Dτ +1 ···D1 (y1 | (y), u, (1)) = fYτ +1 ···Y1 U Dτ +1 ···D1 ((y), y1 , u, (1)) = fDτ +1 |Yτ +1 ···Y1 U Dτ ···D1 (1 | (y), y1 , u, (1)) · fYτ +1 ···Y1 U Dτ ···D1 ((y), y1 , u, (1)) (27) = fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) · fYτ +1 ···Y1 U Dτ ···D1 ((y), y1 , u, (1)) where the last equality is due to Lemma 8 (iii). But the last expression corresponds to ′ the kernel of the composite operator R(y) S(y) , thus showing that T(y) T(y) = R(y) S(y) . ′ But then, L(y),z = P(y) Qz R(y) S(y) = P(y) Qz T(y) T(y) . Note that the invertibility of ′ R(y) and S(y) as required by Assumption 8 implies invertibility of T(y) and T(y) as 86 ′ well, for otherwise the equivalent composite operator T(y) T(y) = R(y) S(y) would have a nontrivial nullspace. Using Restriction 8, form the product of operators as L−1 −1 (y),0 L(y),1 = T(y) Q1/0 T(y) ′ The disappearance of T(y) is due to commutativity of multiplication operators. By the same logic as in Step 1, this expression together with Restriction 8 (ii) admits unique left eigenvalue-eigenfunction decomposition. Moreover, the point spectrum is exactly the same as the one in Step 1, as is the middle multiplication operator Q1/0 . This equivalence of the spectrum allows consistent ordering of U with that of Step 1. Left eigenfunctions yield the kernel of T(y) pinned down by the normalization of unit integral. This shows that the operator T(y) is uniquely determined by the observed joint distribution FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)). Step 4: Uniqueness of FYτ +1 ···Y2 U Dτ +1 ···D1 (· · · , · , (1)) Equation (27) implies that ∫ fY1 |Yτ +1 ···Y2 U Dτ +1 ···D1 (y1 | (y), u, (1)) · fYτ +1 ···Y2 U Dτ +1 ···D1 ((y), u, (1))du = fYτ +1 ···Y1 Dτ +1 ···D1 ((y), y1 , (1)) hence yielding the linear operator equation ∗ T(y) fYτ +1 ···Y2 U Dτ +1 ···D1 ((y), ·, (1)) = fYτ +1 ···Y1 Dτ +1 ···D1 ((y), ·, (1)) ∗ where T(y) denotes the adjoint operator of T(y) . Since T(y) is invertible, so is its adjoint ∗ ′ operator T(y) . But then, the multiplier of the multiplication operator T(y) can be given by the unique solution to the above linear operator equation, i.e., ∗ −1 fYτ +1 ···Y2 U Dτ +1 ···D1 ((y), ·, (1)) = (T(y) ) fYτ +1 ···Y1 Dτ +1 ···D1 ((y), ·, (1)) ∗ T(y) hence T(y) was shown to be uniquely determined by FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) in Step 3, and fYτ +1 ···Y1 Dτ +1 ···D1 (· · · , (1)) is also available from observed data. There- ′ fore, the operator T(y) is uniquely determined by FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)). 87 Step 5: Uniqueness of FDτ +1 |Yτ +1 ···Y2 U (1 | · · · , · ) ′ First, the definition of the operators R(y) , S(y) , T(y) , and T(y) and Lemma 8 (iii) yield ′ ′ the operator equality R(y) S(y) = T(y) T(y) , where T(y) and T(y) have been shown to be uniquely determined by the observed joint distribution FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) in Steps 3 and 4, respectively. Recall that S(y) was also shown in Step 2 to be uniquely determined by the observed joint distributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). Restriction 8 (iv) guarantees invertibility of S(y) . It follows −1 ′ −1 that the operator inversion R(y) = (R(y) S(y) )S(y) = (T(y) T(y) )S(y) yields the operator R(y) , in turn showing that its multiplier fDτ +1 |Yτ +1 ···Y2 U (1 | (y), · ) is uniquely deter- mined for each (y) by the observed joint distributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). Step 6: Uniqueness of FYτ ···Y1 U Dτ −1 ···D1 (· · · , · , (1)) Recall from Step 2 that fYτ +1 ···Y1 U Dτ ···D1 (· · · , · , (1)) is uniquely determined by the ob- served joint distributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). We can write fYτ +1 ···Y1 U Dτ ···D1 (y ′ , (y), u, (1)) = fYτ +1 |Yτ ···Y1 U Dτ ···D1 (y ′ | (y), u, (1)) ×fDτ |Yτ ···Y1 U Dτ −1 ···D1 (1 | (y), u, (1)) ×fYτ ···Y1 U Dτ −1 ···D1 ((y), u, (1)) = fYτ +1 |Yτ ···Y1 U (y ′ | (y), u) · fDτ |Yτ ···Y1 U (1 | (y), u) ×fYτ ···Y1 U Dτ −1 ···D1 ((y), u, (1)) = fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) · fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) ×fYτ ···Y1 U Dτ −1 ···D1 ((y), u, (1)), where the second equality follows from Lemma 8 (ii), and the third equality follows from Lemma 9 (i) and (ii). For a given ((y), u), there must exist some y ′ such that fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) > 0 by a property of conditional density functions. Moreover, Restriction 8 (iii) requires that fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) > 0 for a given 88 (y) for all u. Therefore, for such a choice of y ′ , we can write fYτ +1 ···Y1 U Dτ ···D1 (y ′ , (y), u, (1)) fYτ ···Y1 U Dτ −1 ···D1 ((y), u, (1)) = fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) · fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) fYτ +2 |Yτ +1 ···Y2 U (y ′ | (y), u) was shown in Step 1 to be uniquely determined by the ob- served joint distribution FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)), fYτ +1 ···Y1 U Dτ ···D1 (y ′ , (y), u, (1)) was shown in Step 2 to be uniquely determined by the observed joint distributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)), and fDτ +1 |Yτ +1 ···Y2 U (1 | (y), u) was shown in Step 5 to be uniquely determined by the observed joint distribu- tions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)). Therefore, it follows that the joint density fYτ ···Y1 U Dτ −1 ···D1 (· · · , · , (1)) is uniquely determined by the ob- served joint distributions FYτ +2 ···Y1 ZDτ +1 ···D1 (· · · , · , (1)) and fYτ +1 ···Y1 Dτ ···D1 (· · · , (1)).  10.4. Models with Time-Specific Effects. The baseline model (1) that we considered in this paper assumes that the dynamic model g is time-invariant. It is often more realistic to allow this model to have time-specific effects. Consider the following variant of the model (1).      Yt = gt (Yt−1 , U, Et ) t = 2, · · · , T (State Dynamics)      Dt = h(Yt , U, Vt ) t = 1, · · · , T − 1 (Hazard Model)    FY1 U (Initial joint distribution of (Y1 , U ))      Z = ζ(U, W ) (Optional: nonclassical proxy of U ) The differences from (1) are the t subscripts under g. The objective is to identify the model ({gt }Tt=2 , h, FY1 U , ζ). The main obstacle is that the invariant transition of Lemma 4 (i) is no longer useful. As a result, Steps 2 and 6 in the proof of Lemma 2 break down. In order to remedy this hole, we need to observe data of an additional time period prior to the start of the data, i.e., t = 0. For brevity, we show this result for the case of T = 3. Lemma 11 (Identification). Suppose that Restrictions 1, 2, 3, and 4 hold con- ditionally on Pr(D0 = 1). Then the model ({FYt |Yt−1 U }3t=2 , FDt |Yt U , FY1 U |D0 =1 , FZ|U ) 89 is uniquely determined by the observed joint distributions FY1 Y0 ZD0 ( ·, · , · , 1), FY2 Y1 Y0 ZD1 D0 ( ·, ·, · , · , 1, 1), and FY3 Y2 Y1 Y0 ZD2 D1 D0 ( ·, ·, ·, ·, ·, 1, 1, 1). Proof. Many parts of the proof Lemma 2 remains available. However, under the current model with time-specific transition, the operator Py is time-specific. There- fore, we use two operators Py : L2 (FU ) → L2 (FYt ) and Py′ : L2 (FU ) → L2 (FYt ) for each y defined as ∫ (Py ξ)(y2 ) = fY2 |Y1 U (y2 | y, u) · ξ(u)du, ∫ (Py′ ξ)(y3 ) = fY3 |Y2 U (y3 | y, u) · ξ(u)du, Accordingly, we employ the two observable operators Ly,z : L2 (FYt ) → L2 (FYt ) and L′y,z : L2 (FYt ) → L2 (FYt ) for each (y, z) defined as ∫ (Ly,z ξ)(y2 ) = fY2 Y1 Y0 ZD1 D0 (y2 , y, y0 , z, 1, 1) · ξ(y0 )dy0 , ∫ (L′y,z ξ)(y3 ) = fY3 Y2 Y1 ZD2 D1 D0 (y3 , y, y1 , z, 1, 1, 1) · ξ(y1 )dy1 . All the other operators directly carry over from the proof of Lemma 2 as: (Qz ξ)(u) = fZ|U (z | u) · ξ(u), (Ry ξ)(u) = fD2 |Y2 U (1 | y, u) · ξ(u), ∫ (Sy ξ)(u) = fY2 Y1 U D1 D0 (y, y1 , u, 1, 1) · ξ(y1 )dy1 , ∫ (Ty ξ)(u) = fY1 |Y2 U D2 D1 D0 (y1 | y, u, 1, 1, 1) · ξ(y1 )dy1 , (Ty′ ξ)(u) = fY2 U D2 D1 D0 (y, u, 1, 1, 1) · ξ(u) except that the additional argument D0 = 1 is attached to the kernels of Sy and Ty and the multiplier of Ty′ . The first task is to identify the kernels of these two integral operators. Following Step 1 of the proof of Lemma 2 by using the observed operator L′y,z shows that Py′ and Qz are identified. Equivalently, FY3 |Y2 U and FZ|U are identified. Similarly, following 90 Step 1 by using the observed operator Ly,z shows that Py and Qz are identified. Equivalently, FY2 |Y1 U is identified as well. Next, follow Step 2 of the proof of Lemma 2, except that we use our current definition of Py instead of Py′ . It follows that fY2 Y1 U D1 D0 (y ′ , y, u, 1, 1) = fY2 |Y1 U (y ′ | y, u) · [Py−1 fY2 Y1 D1 D0 ( · , y, 1, 1)](u) where fY2 |Y1 U was identified as the kernel of Py in the previous step, Py was identified in the previous step, and fY2 Y1 D1 D0 ( ·, ·, 1, 1) is observable from data. This shows that the operator Sy is identified for each y. Steps 3–5 analogously follow from the proof of Lemma 2 except that the current definitions of Ly,z , Ry , Sy , Ty , and Ty′ are used. These steps show that Ry in particular are identified for each y. Lastly, extending the argument of Step 6 in the proof of Lemma 2 yields fY2 Y1 U D1 D0 (y ′ , y, u, 1, 1) fY1 U |D0 (y, u | 1) = fY2 |Y1 U (y ′ | y, u)fD2 |Y2 U (1 | y, u)fD0 (1) where fY2 Y1 U D1 D0 ( ·, ·, ·, 1, 1) was identified in the second step, fY2 |Y1 U was identified in the first step, fD2 |Y2 U was identified in the previous step, and fD0 (1) is observable from data. It follows that FY1 U |D0 =1 is identified.  10.5. Censoring by Contemporaneous Dt instead of Lagged Dt . For the main identification result discussed, we assumed that lagged selection indicator Dt induces censored observation of Yt as follows: observe Y1 , observe Y2 if D1 = 1, observe Y3 if D1 = D2 = 1. In many application, contemporaneous Dt instead of lagged Dt may induce censored observation of Yt as follows: observe Y1 , if D1 = 1 observe Y2 if D1 = D2 = 1, observe Y3 if D1 = D2 = D3 = 1. 91 where the model follows a slight modification of (1):     Yt = g(Yt−1 , U, Et ) t = 2, · · · , T (State Dynamics)      Dt = h(Yt , U, Vt ) t = 1, · · · , T (Hazard Model)    FY1 U (Initial joint distribution of (Y1 , U ))      Z = ζ(U, W ) (Optional: nonclassical proxy of U ) (The difference from the baseline model (1) is that the hazard model is defined for all t = 1, · · · , T .) In this model, the problem of identification is to show the well- definition of (FY2 Y1 ZD2 D1 ( ·, ·, ·, 1, 1), FY3 Y2 Y1 ZD3 D2 D1 ( ·, ·, ·, ·, 1, 1, 1)) 7→ (g, h, FY1 U |D1 =1 , ζ). First, consider the following auxiliary lemma, which can be proved similarly to Lemma 3. Lemma 12 (Independence). The following implications hold: (i) Restriction 2 (i) ⇒ Y3 ⊥⊥ (Y1 , D1 , D2 , D3 , Z) | (Y2 , U ). (ii) Restriction 2 (i) ⇒ Y2 ⊥⊥ (D1 , D2 , Z) | (Y1 , U ). (iii) Restriction 2 (ii) ⇒ D3 ⊥⊥ Y2 | (Y3 , U ). (iv) Restriction 2 (iii) ⇒ Z ⊥⊥ (Y2 , Y1 , D3 , D2 , D1 ) | U . Some of the rank conditions of Restriction 3 are replaced as follows. Restriction 9 (Rank Conditions). The following conditions hold for every y ∈ Y: (i) Heterogeneous Dynamics: the integral operator Py : L2 (FU ) → L2 (FYt ) defined ∫ by Py ξ(y ′ ) = fY3 |Y2 U (y ′ | y, u) · ξ(u)du is bounded and invertible. (ii) Nondegenerate Proxy Model: there exists δ > 0 such that δ 6 fZ|U (1 | u) 6 1 − δ for all u. Relevant Proxy: fZ|U (1 | u) ̸= fZ|U (1 | u′ ) whenever u ̸= u′ . (iii) No Extinction: fD2 |Y2 U (1 | y, u) > 0 for all u ∈ U. (iv) Initial Heterogeneity: ˜ y : L2 (Yt ) → L2 (U ), and the two integral operators L 92 ∫ Sy : L2 (U ) → L2 (Yt ) respectively defined by L ˜ y ξ(u) = fY2 Y1 U D3 D2 D1 (y, y1 , u, 1, 1, 1) · ∫ ξ(y1 )dy1 and Sy ξ(y1 ) = fY2 Y1 U D2 D1 (y, y1 , u, 1, 1) · ξ(u)du are bounded and invertible. Lemma 13 (Identification). Under Restrictions 1, 2, 4, and 9, the quadruple (FY3 |Y2 U , FD3 |Y3 U , FY1 U |D1 =1 , FZ|U ) is uniquely determined by FY2 Y1 ZD2 D1 ( ·, ·, ·, 1, 1) and FY3 Y2 Y1 ZD3 D2 D1 ( ·, ·, ·, ·, 1, 1, 1). Proof. Given fixed y and z, define the operators Ly,z : L2 (FYt ) → L2 (FYt ), Py : L2 (FU ) → L2 (FYt ), Qz : L2 (FU ) → L2 (FU ), L ˜ y : L2 (Yt ) → L2 (U ), and Sy : L2 (U ) → L2 (Yt ) by ∫ (Ly,z ξ)(y3 ) = fY3 Y2 Y1 ZD3 D2 D1 (y3 , y, y1 , z, 1, 1, 1) · ξ(y1 )dy1 , ∫ (Py ξ)(y3 ) = fY3 |Y2 U (y3 | y, u) · ξ(u)du, (Qz ξ)(u) = fZ|U (z | u) · ξ(u), ∫ ˜ y ξ)(u) = (L fY2 Y1 U D3 D2 D1 (y, y1 , u, 1, 1, 1) · ξ(y1 )dy1 , ∫ (Sy ξ)(y1 ) = fY2 Y1 U D2 D1 (y, y1 , u, 1, 1) · ξ(u)du respectively. Similarly to the proof of Lemma 2, the operator Ly,z is identified from observed joint distribution FY3 Y2 Y1 ZD3 D2 D1 ( ·, ·, ·, ·, 1, 1, 1). Step 1: Uniqueness of FY3 |Y2 U and FZ|U The kernel fY3 Y2 Y1 ZD3 D2 D1 ( · , y, · , z, 1, 1, 1) of the integral operator Ly,z can be rewritten as ∫ fY3 Y2 Y1 ZD3 D2 D1 (y3 , y, y1 , z, 1, 1, 1) = fY3 |Y2 Y1 ZU D3 D2 D1 (y3 | y, y1 , z, u, 1, 1, 1) (28) ×fZ|Y2 Y1 U D3 D2 D1 (z | y, y1 , u, 1, 1, 1) ×fY2 Y1 U D3 D2 D1 (y, y1 , u, 1, 1, 1) du But by Lemma 12 (i) and (iv) respectively, Restriction 2 implies that fY3 |Y2 Y1 ZU D3 D2 D1 (y3 | y, y1 , z, u, 1, 1, 1) = fY3 |Y2 U (y3 | y, u), fZ|Y2 Y1 U D3 D2 D1 (z | y, y1 , u, 1, 1, 1) = fZ|U (z | u). 93 Equation (28) thus can be rewritten as ∫ fY3 Y2 Y1 ZD3 D2 D1 (y3 , y, y1 , z, 1, 1, 1) = fY3 |Y2 U (y3 | y, u) · fZ|U (z | u) ×fY2 Y1 U D3 D2 D1 (y, y1 , u, 1, 1, 1) du But this implies that the integral operator Ly,z is written as the operator composition ˜y Ly,z = Py Qz L ˜ y are in- Restriction 9 (i), (ii), and (iv) imply that the operators Py , Qz , and L vertible, respectively. Hence so is Ly,z . Using the two values {0, 1} of Z, form the product Ly,1 L−1 −1 y,0 = Py Q1/0 Py where Qz/z′ := Qz Q−1 z ′ is the multiplication operator with proxy odds defined by fZ|U (1 | u) (Q1/0 ξ)(u) = ξ(u). fZ|U (0 | u) The rest of Step 1 is analogous to that of the proof of Lemma 2. There- fore, FY3 |Y2 U and FZ|U are uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD3 D2 D1 ( ·, ·, ·, ·, 1, 1, 1). Equivalently, the operators Py and Qz are uniquely determined for each y and z, respectively. Step 2: Uniqueness of FY2 Y1 U D2 D1 ( · , · , · , 1, 1) By Lemma 12 (ii), Restriction 2 implies fY2 |Y1 U D2 D1 (y ′ | y, u, 1, 1) = fY2 |Y1 U (y ′ | y, u). Using this equality, write the density of the observed joint distribution FY2 Y1 D2 D1 ( · , · , 1, 1) as ∫ ′ fY2 Y1 D2 D1 (y , y, 1, 1) = fY2 |Y1 U D2 D1 (y ′ | y, u, 1, 1)fY1 U D2 D1 (y, u, 1, 1)du ∫ (29) = fY2 |Y1 U (y ′ | y, u)fY1 U D2 D1 (y, u, 1, 1)du By Lemma 4 (i), FY3 |Y2 U (y ′ | y, u) = FY2 |Y1 U (y ′ | y, u) for all y ′ , y, u. Therefore, we can write the operator Py as ∫ ∫ ′ ′ (Py ξ)(y ) = fY3 |Y2 U (y | y, u) · ξ(u)du = fY2 |Y1 U (y ′ | y, u) · ξ(u)du. 94 With this operator notation, it follows from (29) that fY2 Y1 D2 D1 ( · , y, 1, 1) = Py fY1 U D2 D1 (y, · , 1, 1). By Restriction 9 (i), this operator equation can be solved for fY1 U D2 D1 (y, · , 1, 1) as (30) fY1 U D2 D1 (y, · , 1, 1) = Py−1 fY2 Y1 D2 D1 ( · , y, 1, 1) Recall that Py was shown in Step 1 to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1). The function fY2 Y1 D2 D1 ( · , y, 1, 1) is also uniquely determined by the observed joint distribution FY2 Y1 D2 D1 ( · , · , 1, 1) up to null sets. Therefore, (29) shows that fY1 U D2 D1 ( · , · , 1, 1) is uniquely deter- mined by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Using the solution to the above inverse problem, we can write the kernel of the operator Sy as fY2 Y1 U D2 D1 (y ′ , y, u, 1, 1) = fY2 |Y1 U D2 D1 (y ′ | y, u, 1, 1) · fY1 U D2 D1 (y, u, 1, 1) = fY2 |Y1 U (y ′ | y, u) · fY1 U D2 D1 (y, u, 1, 1) = fY3 |Y2 U (y ′ | y, u) · fY1 U D2 D1 (y, u, 1, 1) = fY3 |Y2 U (y ′ | y, u) · [Py−1 fY2 Y1 D2 D1 ( · , y, 1, 1)](u) where the second equality follows from Lemma 12 (ii), the third equality follows from Lemma 4 (i), and the forth equality follows from (30). Since fY3 |Y2 U was shown in Step 1 to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and [Py−1 fY2 Y1 D2 D1 ( · , y, 1, 1)] was shown in the previous paragraph to be uniquely determined for each y by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1), it follows that fY2 Y1 U D2 D1 ( · , · , · , 1, 1) too is uniquely determined by the observed joint dis- tributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Equivalently, the operator Sy is identified for each y. Step 3: Uniqueness of FY3 D3 |Y3 U ( · , 1 | · , · ) 95 The density of the observed joint distribution FY3 Y2 Y1 D3 D2 D1 (y3 , y2 , y1 , 1, 1, 1) can be decomposed as ∫ fY3 Y2 Y1 D3 D2 D1 (y3 , y2 , · , 1, 1, 1) = fY3 D3 |Y2 Y1 U D2 D1 (y3 , 1 | y2 , y1 , u, 1, 1) ×fY2 Y1 U D2 D1 (y2 , y1 , u, 1, 1)du ∫ = fY3 D3 |Y2 U (y3 , 1 | y2 , u) · fY2 Y1 U D2 D1 (y2 , y1 , u, 1, 1)du = Sy2 · fY3 D3 |Y2 U (y3 , 1 | y2 , · ) for each y3 and y2 , where the second equality follows from Lemma 12 (i) and (iii). By Restriction 9 (iv), Sy2 is invertible, and we can rewrite the above equality as fY3 D3 |Y2 U (y3 , 1 | y2 , · ) = Sy−1 2 fY3 Y2 Y1 D3 D2 D1 (y3 , y2 , · , 1, 1, 1). Recall that Sy2 was shown to be uniquely determined in Step 2 by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Therefore, FY3 D3 |Y2 U ( · , 1 | · , · ) is identified by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Step 4: Uniqueness of FD3 |Y3 U (1 | · , · ) The density of the observed joint distribution FY3 D3 |Y2 U (y3 , 1 | y2 , u) can be decom- posed as fY3 D3 |Y2 U (y3 , 1 | y2 , u) = fD3 |Y3 Y2 U (1 | y3 , y2 , u) · fY3 |Y2 U (y3 | y2 , u) = fD3 |Y3 U (1 | y3 , u) · fY3 |Y2 U (y3 | y2 , u) where the second equality follows from Lemma 12 (iii). For each pair (y3 , u) in the support, there exists y2 such that fY3 |Y2 U (y3 | y2 , u) > 0. For such y2 , rewrite the above equation as fY3 D3 |Y2 U (y3 , 1 | y2 , u) fD3 |Y3 U (1 | y3 , u) = . fY3 |Y2 U (y3 | y2 , u) Recall that Step 1 showed that FY3 |Y2 U is uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD3 D2 D1 ( ·, ·, ·, ·, 1, 1, 1), and Step 3 showed that FY3 D3 |Y2 U ( · , 1 | · , · ) is identified by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) 96 and FY2 Y1 D2 D1 ( · , · , 1, 1). Therefore, FD3 |Y3 U (1 | · , · ) is identified by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Step 5: Uniqueness of FY1 U |D1 =1 Recall from Step 2 that fY2 Y1 U D2 D1 ( · , · , · , 1, 1) is uniquely determined by the ob- served joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). We can write fY2 Y1 U D2 D1 (y ′ , y, u, 1, 1) = fD2 |Y2 Y1 U D1 (1 | y ′ , y, u, 1)fY2 |Y1 U D1 (y ′ | y, u, 1)fY1 U D1 (y, u, 1) = fD2 |Y2 U (1 | y ′ , u)fY2 |Y1 U (y ′ | y, u)fY1 U D1 (y, u, 1) = fD3 |Y3 U (1 | y ′ , u)fY3 |Y2 U (y ′ | y, u)fY1 U D1 (y, u, 1) where the second equality follows from Lemma 12 (ii), and the third equality follows from Lemma 4 (i) and (ii). For a given (y, u), there must exist some y ′ such that fY3 |Y2 U (y ′ | y, u) > 0 by a property of conditional density functions. Moreover, Re- striction 9 (iii) requires that fD3 |Y3 U (1 | y ′ , u) > 0 for a given y ′ for all u. Therefore, for such a choice of y ′ , we can write fY2 Y1 U D2 D1 (y ′ , y, u, 1, 1) fY1 U D1 (y, u, 1) = fY3 |Y2 U (y ′ | y, u)fD3 |Y3 U (1 | y ′ , u) Recall that fY3 |Y2 U ( · | · , · ) was shown in Step 1 to be uniquely determined by the observed joint distribution FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1), fY2 Y1 U D2 D1 ( · , · , · , 1, 1) was shown in Step 2 to be uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1), and fD3 |Y3 U (1 | · , · ) was shown in Step 4 to be uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1). Therefore, it follows that the initial joint density fY1 U |D1 =1 is uniquely determined by the observed joint distributions FY3 Y2 Y1 ZD3 D2 D1 ( · , · , · , · , 1, 1, 1) and FY2 Y1 D2 D1 ( · , · , 1, 1).  97 CHAPTER 2 Structural Partial Effects 1. Introduction Economists are often interested in the structural partial effect β of models of the form Y = α + βX + E, X ̸⊥⊥ E. In this constant-coefficient affine structure, the local instrumental variable (LIV) de- fined by d dz E[Y | Z = z]/ dz d E[X | Z = z] using any point of an instrumental variable Z = z identifies this structural parameter β even if the first stage is nonparametric and nonseparable. It boils down to the two-stage least squares when the first stage is a constant-coefficient affine model. More generally, economists are interested in the structural partial effect β(X, E) := ∂ ∂x g(X, E) of nonparametric and nonseparable structural models of the form Y = g(X, E), X ̸⊥⊥ E. This paper shows that the LIV continues to identify the structural partial effect even in this nonparametric framework under certain first-stage restrictions. Moreover, we generalize the LIV identification methods to accommodate a more general class of first-stage models. An identification concept of nonseparable models under exogeneity and mono- tonicity is discussed by Matzkin (2003). Hoderlein and Mammen (2007) discuss what can be identified without monotonicity. Chesher (2003) identifies structural partial effects of nonseparable models under endogeneity. In the meanwhile, control variable approaches are proposed as ways to turn endogeneity into conditional exogeneity (Al- tonji and Matzkin, 2005; Imbens and Newey, 2009). Identification of quantile struc- tural functions under endogeneity is studied by Chernozhukov and Hansen (2005), 98 Chernozhukov, Imbens and Newey (2007), Torgovitzky (2011), and D’Haultfoeuille and F´evrier (2011). Our work is perhaps most closely related to Chesher (2003) who specifically identifies structural partial effects for well-defined subpopulations of economic agents. Because the statistical parameter (two-stage least squares) coincides with the structural partial effect in the classical affine regression models, it is worth starting out with this classical idea. We explore possible directions in which this classical idea can be extend to nonparametric and nonseparable models. Heckman and Vytlacil (1999, 2005, 2007) demonstrate that the localized version of the IV estimator (LIV) identifies structural causal effects under nonparametric binary treatment models. We show that the same statistical object can be used to identify average structural partial effects for a class of nonseparable and nonparametric models in Section 2. We further extend this idea in Section 3 for identification of the general marginal treatment effect (MTE) projected on subpopulations characterized by all observed variables, i.e., E[β(X, E) | Y XZ]. The identifying statistical parameter of a special case of the MTE corresponds to well-known formula previously proposed in the literature (Chesher, 2003; Imbens and Newey, 2009) - see Section 4.2. 2. The Local Instrumental Variable Estimator Recall the classical two-stage constant-coefficient affine structures of the form   Y = α + βX + E (31) where E[(E, U ) | Z] = (0, 0).  X = γ + δZ + U The structural parameter β is identified by the ratio of the two reduced-form mean regressions d dz E[Y | Z = z] (32) . d dz E[X | Z = z] A sample analog of this fraction is nothing but the two-stage least-squares estimator of β.1 1 Recall that the two-stage least squares is β2SLS = e′2 E[Z ′ X]−1 E[Z ′ Y ] = Cov(Z, Y )/Cov(Z, X) = [Cov(Z, Y )/Var(Z)]/[Cov(Z, X)/Var(Z)] = dz E[Y | Z = z]/ dz d d E[X | Z = 99 This fraction (32) remains to identify a variety of causal effects in another impor- tant class of models. Heckman and Vytlacil (1999, 2005, 2007) consider a nonpara- metric framework of binary treatment model of the form   Y = g(X, E) (33) where Z ⊥⊥ (E, U ).  X = 1{h(Z) > U } Despite the apparent discrepancy between the models (31) and (33), they show that the same fraction (32) still identifies the marginal treatment effect E[g(1, E)−g(0, E) | U ] under (33), which in turn can be used to recover various treatment parameters. Heckman and Vytlacil call (32) the local instrumental variable (LIV) estimator.2 Given that the LIV identifies causal effects in both the parametric continuous treatment model (31) and the nonparametric binary treatment model (33), our nat- ural question is: to how much extent can we nonparametrically generalize (31) while keeping the LIV capable of identifying causal effects? This question is practically im- portant because the LIV formula (32) is a natural generalization of the conventional two-stage least squares that certainly work for the classical model (31), whereas the true model may be more general than (31). In order to answer this question, consider a class of nonparametric and nonsepa- rable two-stage structures of the form   Y = g(X, E) (34) where Z ⊥ ⊥ (E, U ),  X = h(Z, U ) ∂ We define the partial effect by β(x, ε) := ∂x g(x, ε). It is straightforward to see that the local average partial effect E[β(X, E) | Z = z] can be identified by the LIV formula (32) under nonparametric regression models. We provide an exact condition under z]. The numerator dzd E[Y | Z = z] identifies the reduced-form composite parameter βδ and the denominator dz E[X | Z = z] identifies the reduced-form first-stage parameter δ under the constant- d coefficient affine model (31). 2 They define the LIV as the derivative dE[Y | P = p]/dp where P = E[X | Z], but it is equivalent to (32). 100 which this LIV identification result remains to hold under the nonseparable model (34). Assumption 6 (Local Rank Condition for LIV). d dz E[X | Z = z] ̸= 0. Assumption 7 (Stochastically Separable First Stage). ( ) Cov β(h(z, U ), E), ∂z ∂ h(z, U ) = 0. Assumption 6 is a generalization of the conventional rank condition δ ̸= 0 under the classical model (31). In the case of endogeneity, Assumption 7 generally amounts to separable first stage model h(z, u) = µ(z) + u, unless the second stage model takes the specific form of a constant-coefficient affine model g(x, ε) = α + βx + ε as in (31). The following theorem states that this assumption is essential for the LIV formula (32) to identify the local average partial effect E[β(X, E) | Z = z]. Theorem 2 (LIV: Necessary and Sufficient Condition). Suppose that Assumption 6 is satisfied for the model (34).3 Then, d E[Y | Z = z] E [β(X, A)|Z = z] = dz d dz E[X | Z = z] holds if and only if Assumption 7 is true. If we allow the second stage structural function g to take more arbitrary forms than the simple affine model (31), this necessary and sufficient condition generally amounts separable first stage, h(z, u) = µ(z) + u, whose representative example is of course the nonparametric mean regression model. Therefore, given the result of Theorem 2, we hereafter discuss the LIV within the framework of the following assumption. Assumption 8 (Separable First Stage). h(z, u) = µ(z) + u. Recall that the two-stage least squares identifying the structural parameter β under the classical model (31) can be interpreted as the coefficient in the regression of 3 In addition, we also assume the following regularity conditions: g and h are continuously differentiable with respect to their first arguments; and β(h(z, ·), ·) is dominated in absolute value by an L1 (FEU ) function. 101 Y on the predicted value of X, i.e., the partial effect of δZ on Y . This interpretation carries over to the nonparametric model (34) under Assumption 8. That is, the structural partial effect E[β(X, E) | Z = z] can be interpreted as and can be identified by the partial effect of µ(Z) on Y . Theorem 3 (Mean and Quantile LIV). Suppose that Assumptions 6 and 8 are satisfied for the model (34).4 With the notation P := µ(Z), the following equalities hold. d (i) E[β(X, E) | Z = z] = E[Y | P = p] and dp p=µ(z) ∂ (ii) E[β(X, E) | Y = QY |P (τ | µ(z)), Z = z] = QY |P (τ | p) ∂p p=µ(z) where QY |P (τ | p) := inf{y | FY |P (y | p) > τ } denotes the τ -th quantile regression of Y on P . Combining the identifying equality of Theorem 2 with Theorem 3 (i), we have d E[Y | Z = z] d E [β(X, A)|Z = z] = d dz = E[Y | P = p] , E[X | Z = z] dp |dz {z } | {z p=µ(z) } (a) (b) which confirms that the conventional property extends to the current nonparametric setting, i.e., the structural partial effect can be identified by both (a) the ratio of reduced-form partial effects and (b) the partial effect of Y on the predicted value P of X. This result even extends to the quantile counterpart. Quantile regressions are only statistical objects, and usually do not have structural interpretations particularly under endogeneity. Theorem 3 (ii), however, shows that the slope of the quantile regression QY |P on the right-hand side does identify an average of the structural partial effects β(X, A) on the left-hand side. Notice that the identifying quantile regression is QY |P , which is in general different from QY |X . 4 In addition, we also assume the following regularity conditions: g is continuously differentiable with respect to x; FY |P is continuously differentiable with respect to y; QY |P is continuously differ- entiable with respect to p; fY |P is continuous in p; and β(h(z, ·), ·) is dominated in absolute value by an L1 (FEU ) function. 102 We make a remark on the analogous expressions between parts (i) and (ii) of Theorem 3. The mean regression E[Y | P ] is used to identify E[β(X, E) | Z] in part (i), whereas the quantile regression QY |P is used to identify E[β(X, E) | Y, Z] in part (ii). This parallel is intuitively straightforward, but is not too simple to be explained concisely with logical precision, because quantile regressions do not directly transform into moments in general. See the proof in the appendix to find out different approaches used to prove the seemingly similar formulas in (i) and (ii). 3. Marginal Treatment Effects The previous section studied identifiability of the local averages of the structural partial effects E[β(X, E) | Z] and E[β(X, E) | Y, Z] by the LIV. However, the LIV is nothing but one particular statistical expression, and need not be considered as the sole identifying device. In this section, we ask if we can extend the idea of the LIV to identify causal effects, E[β(X, E) | X, Z] and E[β(X, E) | Y, X, Z] on finer subpopulations characterized by the additional conditioning variable X. Using the same idea as Theorem 3 (i) under Assumptions 6 and 8 yields d E[β(X, E) | X = µ(z) + u, Z = z] = E[Y | X = p + u, P = p] dp p=µ(z) ∂ ∂ E[Y | X = µ(z) + u, Z = z] (35) = E[Y | X = µ(z) + u, Z = z] + ∂z ∂x µ′ (µ(z)) The identifying statistical object on the right-hand side is no longer the LIV, but the structural partial effect E[β(X, E) | X, Z] of interest is indeed identified. This section presents generalization of this heuristic result by replacing Assumptions 6 and 8 by the following assumptions. Assumption 9 (Invertibility). h(z, · ) is invertible at each z, and υ(z, · ) denotes the inverse. Assumption 10 (Local Rank Condition for MTE). ∂ ∂z υ(z, x) ̸= 0. 103 Theorem 4 (Mean and Quantile MTE). Suppose that Assumptions [ ] [9 and 10 ] are 5 ∂υ(z,x) ∂υ(z,x) satisfied for the model (34). With the notation ρ(z, x) := ∂x / ∂z , the following equalities hold. ∂ ∂ (i) E[β(X, E) | X = x, Z = z] = E[Y | X = x, Z = z] − ρ(z, x) · E[Y | X = x, Z = z] and ∂x ∂z ∂ ∂ (ii) E[β(X, E) | Y = QY |XZ (τ | x, z), X = x, Z = z] = QY |XZ (τ | x, z) − ρ(z, x) · QY |XZ (τ | x, z) ∂x ∂z Remark 11. While the model (34) entails the strong instrument independence Z ⊥⊥ (E, U ), only a weaker form of instrument independence Z ⊥⊥ E | U is required for Theorem 4. In other words, the instrument Z may be correlated with the first-stage unobservable U . Observe the parallel between parts (i) and (ii) of Theorem 4, which is similar to that of Theorem 3. The mean regression E[Y | X, Z] is used to identify E[β(X, E) | X, Z] in part (i), whereas the quantile regression QY |X,Z is used to identify E[β(X, E) | Y, X, Z] in part (ii). Again, this parallel is not too simple to be explained concisely because differentiating quantile regressions do not directly transform into moments of derivatives. Theorem 4 (i) proposes identification of E[β(X, E) | X, Z], but the conditioning variable X is by itself an endogenous outcome of more primitive variables (Z, U ). In order to give precise economic interpretation to this causal effect, it would be useful to identify causal effects conditional on a set of primitive variables, e.g., E[β(x, E) | Z, U ] instead of E[β(X, E) | X, Z]. Under Assumption 9, they in fact coincide to each other: (36) E[β(x, E) | Z = z, U = υ(z, x)] = E[β(X, E) | X = x, Z = z ]. | {z } | {z } Primitive Condition Endogenous Condition 5 In addition, we also assume the following regularity conditions: g is continuously differentiable with respect to x; FY |XZ is continuously differentiable with respect to y; QY |XZ is continuously differentiable with respect to (x, z); fY |XZ is continuously differentiable with respect to (x, z); and β(x, ·) is dominated in absolute value by an L1 (FE|XZ ) function. 104 This shows that the formula in Theorem 4 (i) identifies the causal effects E[β(x, E) | Z, U ], which is close in spirit to Heckman and Vytlacil’s (2005) marginal treatment ef- fect (MTE) proposed in the context of the binary treatment model (33). We therefore refer to the causal effects indentified in this section as the MTE. Because Theorem 4 requires Assumption 9, identification of the MTE presumes our knowledge of the first-stage structure h. In many economic applications, struc- tural construction of economic models provide explicit formula for the first-stage function h. The following example illustrates the case in point. Example 2. Suppose that Y = g(X, E) models the demand for a single good Y as a function of income X and preferences E as in the study of Engel curves. Income X can be taken to be within-period total expenditure, which is justified under preference restrictions (Lewbel, 1999). The first-stage endogenous choice X = h(Z, U ) is modeled as a result of optimization behaviors. Suppose that the dynamic consumption decision of economic agent at time t is given by [ T¯−t ] ∑ max¯ Et β τ u(Xt+τ ; θ) s.t. Mt+1 = (Mt − Xt )R + Zt+1 −t {Xt+τ }T τ =0 τ =0 where u(·; θ) is the CARA utility function with parameter θ, Zt is the consumer’s idiosyncratic labor income, Mt denotes assets, and R is the interest factor which is fixed and deterministic for simplicity. If the growth of Zt is stochastic with Gaussian iid innovation with variance σ 2 , then the first-stage function for individuals with no initial assets is given by σ2θ X t = Zt − , 2β where individuals have heterogeneous structural parameters (β, θ, σ 2 ). Applying The- orem 4 (i) together with (36) yields identification of the local average maginal Engel coefficient ∂ ∂ E[β(x, E) | Z = z, σ 2 θ/β = 2(z − x)] = E[Y | X = x, Z = z] + E[Y | X = x, Z = z] ∂x ∂z 105 The object identified in this equation has a clear economic interpretation: the average of heterogeneous marginal Engel coefficients at X = x among the subpop- ulation of consumers earning z units of income with volatility σ 2 , having preference parameters (β, θ) satisfying the relation σ 2 θ/β = 2(z − x). Note that, in this example, the first-stage structure is trivially identified to be X = Z + U where U := −0.5σ 2 θ/β. This trivial identification of the first-stage holds even without any statistical or mean independence conditions between Z and U . In other words, instrument independence for the first-stage unobservables need not hold for the purpose of identifying the structural causal effects (see Remark 11). 4. Two Special Cases of the MTE Example 2 illustrated a clear economic interpretation of the identified causal ef- fects (MTE) when the first stage is structurally constructed. Many applications, how- ever, lack such structural motivations. In the absence of structural models, economists have often substituted statistical objects such as mean regressions and quantile re- gressions. In this section, we demonstrate that the MTE is also compatible with such statistical devices, though its economic interpretation becomes less clear than in the case of Example 2. Sections 4.1 and 4.2 propose the special cases of the MTE when the first stage is abstractly summarized by a mean regression and a quantile regression, respectively. 4.1. Case 1: When First Stage Is a Mean Regression. Suppose that the first-stage model is a nonparametric mean regression of the form h(z, u) = µ(z) + u where E[U | Z] = 0. In this case, Assumption 9 is trivially satisfied. Furthermore, the traditional local rank condition µ′ (z) ̸= 0 satisfies Assumption 10, and Theorem 4 can therefore be used. Note that Remark 11 following Theorem 4 states that statistical independence between the instrument Z and the first-stage unobservable U is not required. There- fore, heteroscedasticity in the first stage E[U 2 | Z] ̸= 0 is admissible in particular. 106 Applying the theorem to this special case entails ρ(z, x) = −1/[µ′ (z)] and yields the following result. Corollary 3 (MTE When First Stage is a Mean Regression). Suppose that the model is given by   Y = g(X, E) where Z ⊥ ⊥E |U  X = µ(Z) + U where E[U | Z] = 0 and µ′ (z) ̸= 0 Then, the following identifying equalities hold. ∂ ∂ ∂z E[Y | X = x, Z = z] (i) E[β(X, E) | X = x, Z = z] = E[Y | X = x, Z = z] + and ∂x µ′ (z) ∂ ∂ ∂z QY |XZ (τ | x, z) (ii) E[β(X, E) | Y = QY |XZ (τ | x, z), X = x, Z = z] = QY |XZ (τ | x, z) + ∂x µ′ (z) Not surprisingly, Corollary 3 (i) corresponds to (35), by which the MTE was motivated as a natural extension of the LIV under separable first stage models. 4.2. Case 2: When First Stage Is a Quantile Regression. Another im- portant special case is when the first-stage model is represented by a nonparametric quantile regression of the form h(z, u) = QX|Z (u | z) where Z ⊥⊥ U. In this case, Assumption 9 is satisfied if the X | Z = z is continuously distributed. Furthermore, the traditional local rank condition | z) ̸= 0 satisfies Assump- ∂ F (x ∂z X|Z [∂ ]−1 tion 10. Applying Theorem 4 to this special case entails ρ(z, x) = − ∂z QX|Z (u | z) where u = FX|Z (x | z), and therefore yields the following result. Corollary 4 (MTE When First Stage is a Quantile Regression). Suppose that the model is given by   Y = g(X, E) where Z ⊥⊥ (E, U )  X = Q (U | Z) X|Z 107 Then, the following identifying equalities hold. ∂z E[Y | X = x, Z = ∂ ∂ z] (i) E[β(X, E) | X = x, Z = z] = E[Y | X = x, Z = z] + and ∂z QX|Z (u | z) ∂x ∂ ∂z QY |XZ (τ | x, z) ∂ ∂ (ii) E[β(X, E) | Y = QY |XZ (τ | x, z), X = x, Z = z] = QY |XZ (τ | x, z) + ∂z QX|Z (u | z) ∂x ∂ where u := FX|Z (x | z). The right-hand side of Corollary 4 (ii) turns out to be the same as the identifying equality ∂ ∂ ∂ Q ∂z Y |XZ (τ | x, z) (37) QY |XU (y | x, u) = QY |XZ (τ | x, z) + ∂x ∂x ∂ Q (u | z) ∂z X|Z derived by Imbens and Newey (2009; Theorem 2). The left-hand side expressions, however, are different. Equation (37) identifies the statistical object ∂ Q ∂x Y |XU (y | x, u), whereas Corollary 4 (ii) identifies a local average of the structural object β(X, E). Quantile partial effects do not generally represent structural partial effects unless rank invariance is assumed. Our result therefore adds a structural interpre- tation to (37), and parallels Chesher (2003), who originally derived this formula to identify non-averaged structural partial effect. The identifying equalities in Corollary 4 (i) and (ii) remain to hold whenever the first-stage function h is monotone with respect to the unobservables u, because a quantile regression can be used to represent a monotone first stage. The monotonicity, however, is also the exact limit up to which they remain to hold. In other words, the following assumption is a necessary and sufficient condition for these identifying equalities. ¯ : R2 → R and ι : RM → Assumption 11 (Monotonicity). There exist functions h ¯ ι(u)) and h R such that h(z, u) = h(z, ¯ is strictly monotone in its second argument. Proposition 3 (Necessary and Sufficient Condition). Suppose that FX|Z (· | z) is strictly increasing in the model (34).6 Then the identifying equalities in Corollary 4 6 In addition, we also assume the following regularity conditions: g is continuously differentiable with respect to x; h is continuously differentiable; FX|Z is absolutely continuous; QX|Z is continu- ously differentiable with respect to z; and fU |XZ is continuously differentiable with respect to (x, z). 108 (c) Identification of Quantile Partial Effect / ∂ Q ∂x Y |XU (y | x, u) = (b) Control Variable GF O ∂ QY |XZ (τ |x,z) Imbens & Newey (2009) ∂ Q ∂x Y |XZ (τ | x, z) + ∂z ∂ Q (u|z) ∂z X|Z    Imbens (2007) Imbens & Newey (2009) Same Chesher Formula  on  the Right Hand Side   (d) Identification of Structural Partial Effect @A / E[β(X, E) | X = x, Z = z] = Proposition 3 / (a) Monotonicity o ∂ Q (τ |x,z) ∂ Q ∂x Y |XZ (τ | x, z) + ∂z Y |XZ ∂ Q (u|z) ∂z X|Z Figure 2.1. The role of monotonicity in related identification results. (i) and (ii) hold for all structural models (g, FE|U ) if and only if Assumption 11 holds for the first-stage function h. Imbens and Newey (2009) show that the monotonicity is sufficient for a control variable,7 which in turn is sufficient for the identifying equality (37). We show that the monotonicity is necessary and sufficient for the identifying equality in Corollary 4 (ii). These relations are summarized in the logical diagram in Figure 2.1. 5. Nonlinear Heterogeneous Effects of Smoking Adverse effects of smoking during pregnancy on infant birth weights have been extensively studied in the health economic literature (e.g., Rosenzweig and Schultz (1983); Evans and Ringel (1999); Lien and Evans (2005)). Most papers, including those in the medical literature, have suggested that the effect of smoking (as binary variable) on infant birth weights ranges from −200 grams to −400 grams. Given that the average number of cigarettes smoked by smoking pregnant women is 12 between years 1989 and 1999, average effects of one cigarette on infant birth weight thus ranges from −17 grams to −33 grams. The goal of our analysis is to provide a much more detailed assessment of the effect of smoking, in particular we consider the 7 Imbens (2007) discusses its necessity. This is followed up by Kasy (2011). 109 heterogeneous marginal effects of a single cigarette as opposed to these coarse average effects. Specifically, we analyze the effects of the number of cigarettes on infant birth weight, extending an older idea of Evans and Ringel (1999). We allow for arbitrary nonlinear, endogenous and heterogeneous effects of smoking, and want to obtain averages of causal marginal effects for various subpopulations defined by treatment intensity, as well as other variables that proxy for unobserved heterogeneity as detailed below. Evans and Ringel use cigarette excise tax rate as source of exogenous variation to mitigate confounding factors in identifying the effects of smoking. We follow this idea; in our framework tax rates hence play the role of Z, while number of cigarettes per day and infant birth weight are X and Y , respectively. The causal model is then given by   Y = g(X, S, E)  X = h(Z, S, U ) where E captures other unobserved factors related to the lifestyle of the mother that impact the child’s birth weight. Other observed characteristics of the mother, denoted S, are also controlled for, including maternal age, alcohol intake, number of prenatal visits, and number of live births experienced. We use a cross section of the natality data from the Natality Vital Statistics System of the National Center for Health Statistics. The main variables in the data are summarized in Table 2.1. From this data set we extract a random sample of size 100,000 from the time period between 1989 to 1999. The structural features of interest are the averages marginal effect of a cigarette, β(X, S, E), using subpopulation defined by combinations of Y , X and Z. Such causal effects are identified in Theorems 2–4 and Corollaries 3 and 4. Note that all these identifying equalities are proposed with derivatives of nonparametric mean and/or quantile regressions, whose estimation and large sample theories are very standard in the econometric literature (see Fan and Gijbels, 1996). We do not elaborate on these standard statistical results in this paper. Estimates of the structural partial effects 110 Variable Mean Std. Dev. Description Birth Weight 3330 606 Infant birth weight measured in grams Cigarette 1.75 5.51 Number of cigarettes smoked per day Tax 30.4 15.5 Excise tax rate on cigarettes in percentage Age 26.7 6.0 Maternal age Drinks 0.04 0.75 Number of times of drinking per week Visits 11.3 4.1 Number of prenatal care visits Births 1.97 1.00 Number of live births experienced Table 2.1. Descriptive statistics of the data. Figure 2.2. Confidence intervals of E[β(X, S, A) | Y = 2500, X = x, Z = 0.30, S = s¯]. β(X, S, E) projected on (Y, X, Z) are plotted in Figures 2.2–2.7). Due to the point mass of the distribution of X at X = 0 which conflicts the assumption of absolute continuity, our analysis focuses on the domain outside of this locality. With this framework in place, we make the following observations: 1. Comparing the graphs with lower Z (e.g., Figure 2.2) and higher Z (e.g., Fig- ure 2.6), we observe ceteris paribus a great deal of heterogeneity in overall effects. In 111 Figure 2.3. Confidence intervals of E[β(X, S, A) | Y = 3000, X = x, Z = 0.30, S = s¯]. Figure 2.4. Confidence intervals of E[β(X, S, A) | Y = 2500, X = x, Z = 0.40, S = s¯]. particular, the marginal effects under higher tax rates are relatively larger in magni- tude. In other words, pregnant women who still choose to smoke despite facing higher tax rates exhibit larger marginal effects of smoking on infant birth weights. We will discuss this phenomenon in more detail below. 112 Figure 2.5. Confidence intervals of E[β(X, S, A) | Y = 3000, X = x, Z = 0.40, S = s¯]. Figure 2.6. Confidence intervals of E[β(X, S, A) | Y = 2500, X = x, Z = 0.50, S = s¯]. 2. Comparing the marginal effects across X, we observe a common tendency for marginal effects to diminish towards x = 20 (e.g., Figures 2.3–2.7). That is, the neg- atively sloped structural function g will eventually flatten on average as x increases. This phenomenon reflects the reduction in harm of an additional cigarette as the number of cigarettes increases, i.e., diminishing marginal effects. It is imperative to 113 Figure 2.7. Confidence intervals of E[β(X, S, A) | Y = 3000, X = x, Z = 0.50, S = s¯]. keep in mind, however, that a woman who smoked 20 cigarettes a day has already inflicted a large cumulative effect on her child. 3. Comparing the graphs with different values of Y (e.g., Figures 2.2 and 2.3), we observe some differences in marginal effects across quantiles of Y , especially at lower tax rates z = 30. Marginal effects of smoking on birth weights tend to be smaller for lower quantiles of Y . This makes sense as it is more difficult to reduce a birth weight that is already low by the same absolute value (though a similar percentage reduction seems conceivable). These quantile differences are milder at higher tax rates z > 40. However, the differences in Y are not pronounced in this application, and as a consequence it may be justified to focus on the difference across (x, z) by integrating out Y . These effects are illustrated in Figures 2.8–2.10, and they reinforce nicely the observations made in the first two points above. It is instructive to examine the first point in more detail and provide likely causal explanations. As the graphs indicate, the magnitude of partial effects tends to be negatively related to Z for each fixed value of X. Suppose now that z ′ > z. The subpopulation who smokes x cigarettes when the taxes are z ′ is then characterized by a higher preference for smoking than the subpopulation that smokes x cigarettes 114 Figure 2.8. Confidence intervals of E[β(X, S, A) | X = x, Z = 0.30, S = s¯]. Figure 2.9. Confidence intervals of E[β(X, S, A) | X = x, Z = 0.40, S = s¯]. at the lower price (tax) z. What causes endogeneity is now precisely the correlation between this preference for smoking and other factors in E, in particular adverse ones, say, a preference for an unhealthy lifestyle, and/or a partner who also smokes. The graphical results imply that the magnitude of partial effects tends to be positively related to higher taxes in excess of the effect already incurred through X, suggesting 115 Figure 2.10. Confidence intervals of E[β(X, S, A) | X = x, Z = 0.50, S = s¯]. this revealed preference for a negative lifestyle as explanation. Moreover, it implies that the magnitude of partial effects tends to be positively related with unhealthy factors in E, other things fixed. Lastly, this implies the cross partial sign ∂ ∂ 2 = − ∂ g(x, ε). 0< g(x, ε) ∂ε ∂x ∂ε∂x Therefore, smoking X and other unhealthy behavioral inputs E are likely to be com- plementary negative inputs in the birth weight “production” function g. So, based on our results, policy should not just discourage smoking, but also the negative and unhealthy life style associated with it that exacerbates its effect. 6. Summary Economists are generally interested in estimating structural objects such as the structural partial effects β(X, E). When E is multi-dimensional and/or the structural function g is not invertible with respect to E, quantile partial effects generally do not represent this structural partial effects of interest. This paper explored possibilities of identifying local means of the structural partial effects. The main results are summarized in Table 2.2. 116 We first considered the LIV as a natural extension of the classical two-stage least squares. This method identifies a useful policy parameter, but at the cost of separable first stage assumption, which is necessary as well as sufficient (Theorem 2). The classical idea of “regressing” Y on the predicted values of X remains to work in identifying E[β(X, E) | Z], even if g is nonparametric and nonseparable (Theorem 3 (i)). Moreover, this idea extends to identification of E[β(X, E) | Y Z] simply by replacing the mean regression by the corresponding quantile regression (Theorem 3 (ii)). We next considered identifying the structural partial effects controlling for the endogenous choice variable X. Equation (35) heuristically demonstrated that such a calsal effect can be identified as a result of a slight extension of the LIV concept, given that we have a prior knowledge about how the first-stage model looks like. This heuristic finding was generalized in Theorem 4 (i), showing identification of E[β(X, E) | XZ]. Replacing the mean regressions by the corresponding quantile regressions allowed identification of E[β(X, E) | Y XZ] (Theorem 4 (ii)). This parallell between parts (i) and (ii) of Theorem 4 resembles that of Theorem 3. These structural partial effects (MTE) have clear economic interpretations when the first-stage model is structurally constructed, as demonstrated in Example 2. In the absence of structural motivations in the first stage, we can still substitute statistical models such as the mean regression or the quantile regression to represent a first-stage model (Corollaries 3 and 4). The identifying formula in the special case of using quantile regressions to represent the first-stage model (Corollary 4 (ii)) coincides with the well-known formula previously proposed by Chesher (2003) and Imbens and Newey (2009). The identified objects, however, are different from each other. With the abilities of the identified objects to describe heterogeneous structural partial effects, we studied causal effects of smoking on infant birth weights. Smoking is significantly malignant with diminishing marginal effects. Moreover, these marginal effects tend to be greater for those mothers smoking under higher cigarette excise tax 117 Identified Identifying First-Stage Structural Object Statistical Object Structural/Statistical Model Local Two-Stage Least Squares Stochastically Separable Structure Theorem 2 E[β(X, E)|Z] d dz E[Y | Z]/ dz d E[X | Z] Cov(β(h(z, U ), E), ∂z ∂ h(z, U )) = 0 LIV (i) E[β(X, E)|Z] d dp E[Y |P ] where P := µ(Z) Separable Structure Theorem 3 (ii) E[β(X, E)|Y Z] ∂ ∂p Q(Y |P ) where P := µ(Z) X = µ(Z) + U, Z ⊥⊥ U General Identifiable Structure (i) E[β(X, E)|XZ] ∂ ∂x E[Y |XZ] − ∂ ∂z E[Y |XZ] · ρ(Z, X) (see Example 2) Theorem 4 Z and U may (ii) E[β(X, E)|Y XZ] ∂ ∂x Q(Y |XZ) − ∂ ∂z Q(Y |XZ) · ρ(Z, X) X = h(Z, U ), be correlated (i) E[β(X, E)|XZ] ∂ ∂x E[Y |XZ] + ∂ ∂z E[Y |XZ]/ dz d E[X|Z] Mean Regression MTE Corollary 3 (ii) E[β(X, E)|Y XZ] ∂ ∂x Q(Y |XZ) + ∂ ∂z Q(Y |XZ)/ dz d E[X|Z] X = µ(Z) + U, E[U |Z] = 0 (i) E[β(X, E)|XZ] ∂ ∂x E[Y |XZ] + ∂ ∂z E[Y |XZ]/ ∂z ∂ Q(X|Z) Quantile Regression Corollary 4 (ii) E[β(X, E)|Y XZ] ∂ ∂x Q(Y |XZ) + ∂ ∂z Q(Y |XZ)/ dz d Q(X|Z) X = QX|Z (U |Z), Z⊥ ⊥U Table 2.2. Summary of identified structural parameters and the respective first-stage models. rate. Our inspection of this result implies that smoking and its associated unhealthy life style have complementary negative effects on infant birth weights. 7. Mathematical Appendix 7.1. Proof of Theorem 2. Proof. Using the definition (34) of the structural model, we have ∫ ∫ ∫ ∫ E [Y |Z = z] = g(h(z, u), ε)fEU |Z (ε, u | z)dεdu = g(h(z, u), ε)fEU (ε, u)dεdu where the last equality is due to the instrument independence in (34). Taking deriva- tives on the both sides produces ∫ ∫ [ ] d ∂ E [Y |Z = z] = β(h(z, u), ε) h(z, u) fEU (ε, u)dεdu dz ∂z ∫ ∫ [ ] ∂ = β(h(z, u), ε) h(z, u) fEU |Z (ε, u | z)dεdu ∂z [ ] ∂ = E β(X, E) · h(Z, U ) Z = z ∂z [ ] ∂ = E [β(X, E)|Z = z] · E h(Z, U ) Z = z ∂z ( ) ∂ +Cov β(X, E), h(Z, U ) Z = z ∂z where the first equality is due to the differentiability of g and h with respect their first arguments as well as the L1 dominance of the integrand, and the second equality 118 is again due to the instrument independence in (34). The instrument independence also yields [ ] ∂ d E h(Z, U ) Z = z = E [ X| Z = z] and ∂z dz ( ) ( ) ∂ ∂ Cov β(X, E), h(Z, U ) Z = z = Cov β(h(z, U ), E), h(z, U ) . ∂z ∂z Substituting these equalities and rearranging terms under Assumption 6, we obtain ( ) d E [Y |Z = z] − Cov β(h(z, U ), E), ∂ h(z, U ) E [β(X, E)|Z = z] = dz d ∂z dz E [ X| Z = z] Therefore, the desired equality holds if and only if Assumption 7 is true.  7.2. Proof of Theorem 3. Proof. (i) Using Assumption 8, we can write ∫ ∫ d d E[Y | P = p] = g(p + u, ε)fEU |P (ε, u | p) dp p=µ(z) dp p=µ(z) ∫ ∫ d = g(p + u, ε)fEU (ε, u) dp p=µ(z) ∫ ∫ = β(µ(z) + u, ε)fEU (ε, u) ∫ ∫ = β(µ(z) + u, ε)fEU |Z (ε, u | z) = E[β(X, E) | Z = z], where the second and fourth equalities are due to the instrumental independence in the model (34), and the third equality is due to the differentiability of g with respect to x as well as the L1 -domination of β. (ii) We derive the following three auxiliary equations. First, Pr[g(X, E) 6 QY |P (τ | p + δ) | P = p + δ] − Pr[g(X, E) 6 QY |P (τ | p) | P = p + δ] = FY |P (QY |P (τ | p + δ) | p + δ) − FY |P (QY |P (τ | p) | p + δ) [ ] ∂ (38)= δ Q (τ | p) fY |P (QY |P (τ | p) | p + δ) + o(δ) ∂p Y |P 119 holds under the differentiability of FY |P and QY |P with respect to y and p, respectively. Second, Pr[g(X, E) 6 QY |P (τ | p) | P = p + δ] − Pr[g(X + δ, E) 6 QY |P (τ | p) | P = p] = Pr[g(p + δ + U, E) 6 QY |P (τ | p) | P = p + δ] − Pr[g(p + δ + U, E) 6 QY |P (τ | p) | P = p] (39) = Pr[g(p + δ + U, E) 6 QY |P (τ | p)] − Pr[g(p + δ + U, E) 6 QY |P (τ | p)] = 0, where the second equality is due to the instrumental independence in the model (34). Third, using the short-hand notation B = β(X, E), we have Pr[g(X + δ, E) 6 QY |P (τ | p) | P = p] − Pr[g(X, E) 6 QY |P (τ | p) | P = p] = Pr[QY |P (τ | p) < Y 6 QY |P (τ | p) − (g(X + δ, E) − Y ) | P = p] −Pr[QY |P (τ | p) − (g(X + δ, p) − Y ) < Y 6 QY |P (τ | p) | P = p] = Pr[QY |P (τ | p) 6 Y 6 QY |P (τ | p) − δB | P = p] −Pr[QY |P (τ | p) − δB 6 Y 6 QY |P (τ | p) | P = p] + o(δ) ∫ ∞ ∫ −δ−1 [y−QY |P (τ |p)] = fY B|P (y, b | p)dbdy QY |P (τ |p) −∞ ∫ QY |P (τ |p) ∫ ∞ − fY B|P (y, b | p)dbdy + o(δ) −∞ −δ −1 [y−QY |P (τ |p)] ∫ 0 ∫ ∞ = −δ bfY B|P (QY |P (τ | p), b | p)db − δ bfY B|P (QY |P (τ | p), b | p)db + o(δ) −∞ 0 = −δE[B | Y = QY |P (τ | p), P = p] · fY |P (QY |P (τ | p) | p) + o(δ), (40) where the second equality is due to the differentiability of g and FY |P with respect to x and y, respectively, and the fourth equality is due to change of variables and integration by parts. Add these three equations (38), (39) and (40) together to get [ ] ∂ 0 = δ QY |P (τ | p) · fY |P (QY |P (τ | p) | p + δ) ∂p − δE[B | Y = QY |P (τ | p), P = p] · fY |P (QY |P (τ | p) | p) + o(δ). Under the condition that fY |P is continuous in p, letting δ → 0 yields ∂ E[β(X, E) | Y = QY |P (τ | p), P = p] = QY |P (τ | p). ∂p 120 Since Z = z and P = µ(z) are the same events under Assumption 6, setting p = µ(z) yields the result.  7.3. Proof of Theorem 4. Proof. (i) We derive the following two auxiliary equations. First, ∫ ∫ ∂ ∂ E[Y | X = x, Z = z] = ∂x g(x, ε)fE|XZ (ε | x, z)dε = g(x, ε)fE|U Z (ε | υ(z, x), z)dε ∂x ∂x = E[β(X, E) | U = υ(z, x), Z = z] [ ] ∂ ∂ + υ(z, x) · E g(X, E) log fE|U Z (E | U, Z) U = υ(z, x), Z = z , ∂x ∂u where the second equality is due to Assumption 9 and the third equality is due to the differentiability of g with respect to x as well as the L1 dominance of the integrand. Second, a similar calculation yields [ ] ∂ ∂ ∂ E[Y | X = x, Z = z] = υ(z, x)·E g(X, E) log fE|U Z (E | U, Z) U = υ(z, x), Z = z , ∂z ∂z ∂u ∂ where the instrumental independence in model (34) was used to vanish f ∂z E|U Z . Com- bining the above two equations and rearranging by Assumption 10 yield the desired result. (ii) Assumptions 9 and 10 provide the parameterized curve h 7→ (h, δz (h)) =: (δx , δz ) that solves the implicit function equation υ(z + δz , x + δx ) − υ(z, x) = 0 of a smooth submanifold in a neighborhood of h = 0. Furthermore, δz (0) = 0 and (δx , δz ) → 0 as h → 0. By these properties, we have ∂ δz υ(z, x) (41) = − ∂x ∂ + o(1) as h → 0. δx ∂z υ(z, x) Next, we derive the following four auxiliary equations. First, Pr[g(x + δx , E) 6 QY |XZ (τ | x + δx , z + δz ) | X = x + δx , Z = z + δz ] −Pr[g(x + δx , E) 6 QY |XZ (τ | x, z + δz ) | X = x + δx , Z = z + δz ] = FY |XZ (QY |XZ (τ | x + δx , z + δz ) | x + δx , z + δz ) −FY |XZ (QY |XZ (τ | x, z + δz ) | x + δx , z + δz ) ∂ (42) = δx QY |XZ (τ | x, z)fY |XZ (QY |XZ (τ | x, z) | x + δx , z + δz ) + o(δx ), ∂x 121 where the last equality is due to the differentiability of QY |XZ and FY |XZ with respect to x and y, respectively. Second, similar lines of calculations yield Pr[g(x + δx , E) 6 QY |XZ (τ | x, z + δz ) | X = x + δx , Z = z + δz ] −Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | X = x + δx , Z = z + δz ] = FY |XZ (QY |XZ (τ | x, z + δz ) | x + δx , z + δz ) −FY |XZ (QY |XZ (τ | x, z) | x + δx , z + δz ) ∂ (43) = δz QY |XZ (τ | x, z)fY |XZ (QY |XZ (τ | x, z) | x + δx , z + δz ) + o(δz ), ∂z under the differentiability of QY |XZ with respect to z. Third, Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | X = x + δx , Z = z + δz ] −Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | X = x, Z = z] = Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | Z = z + δz , U = υ(z + δz , x + δx )] −Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | Z = z, U = υ(z, x)] = Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | Z = z + δz , U = υ(z, x)] (44) −Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | Z = z, U = υ(z, x)] = 0, where the first equality is due to Assumption 9, the second equality is due to the definition of (δx , δz ), and the last equality is due to the instrument independence in 122 the model (34). Fourth, with the short-hand notation B := β(X, E), we have Pr[g(x + δx , E) 6 QY |XZ (τ | x, z) | X = x, Z = z] −Pr[g(x, E) 6 QY |XZ (τ | x, z) | X = x, Z = z] = Pr[QY |XZ (τ | x, z) < Y 6 QY |XZ (τ | x, z) − (g(x + δx , E) − Y ) | X = x, Z = z] −Pr[QY |XZ (τ | x, z) − (g(x + δx , z) − Y ) < Y 6 QY |XZ (τ | x, z) | X = x, Z = z] = Pr[QY |XZ (τ | x, z) 6 Y 6 QY |XZ (τ | x, z) − δx B | X = x, Z = z] −Pr[QY |XZ (τ | x, z) − δx B 6 Y 6 QY |XZ (τ | x, z) | X = x, Z = z] + o(δx ) ∫ ∞ ∫ −δx−1 [y−QY |XZ (τ |x,z)] = fY B|XZ (y, b | x, z)dbdy QY |XZ (τ |x,z) −∞ ∫ QY |XZ (τ |x,z) ∫ ∞ − fY B|XZ (y, b | x, z)dbdy + o(δx ) −1 −∞ −δx [y−QY |XZ (τ |x,z)] ∫ 0 = −δx bfY B|XZ (QY |XZ (τ | x, z), b | x, z)db −∞ ∫ ∞ −δx bfY B|XZ (QY |XZ (τ | x, z), b | x, z)db + o(δx ) 0 (45) = −δx E[B | Y = QY |XZ (τ | x, z), X = x, Z = z]fY |XZ (QY |XZ (τ | x, z) | x, z) + o(δx ), where the second equality is due to the differentiability of g and FY |XZ with respect to x and y, respectively, and the fourth equality is due to change of variables and integration by parts. Add the above four equations (42), (43), (44) and (45) together to get ∂ 0 = δx QY |XZ (τ | x, z)fY |XZ (QY |XZ (τ | x, z) | x + δx , z + δz ) ∂x ∂ +δz QY |XZ (τ | x, z)fY |XZ (QY |XZ (τ | x, z) | x + δx , z + δz ) ∂z −δx E[B | Y = QY |XZ (τ | x, z), X = x, Z = z]fY |XZ (QY |XZ (τ | x, z) | x, z) +o(δx ) + o(δz ). The desired result follows from this equation together with Equation (41), Assumption 9, and the differentiability of fY |XZ with respect to the conditioning variables (x, z).  7.4. Proof of Proposition 3. We prove the following four auxiliary lemmas to prove Proposition 3. 123 Lemma 14. Suppose that FX|Z (· | z) is strictly increasing in the model (34). Then, Assumption 11 holds if and only if FX|Z (h(z, u) | z) = FX|Z (h(z ′ , u) | z ′ ) holds for all z, z ′ and u. ¯ ι(u)) holds for all (z, u), where h Proof. (⇒) Suppose that h(z, u) = h(z, ¯ is ¯ ι(u)) | strictly increasing in its second argument. Then, FX|Z (h(z, u) | z) = FX|Z (h(z, ¯ ι(u)) | Z = z) (1) ¯ ι(U )) 6 h(z, z) = Pr(h(z, (2) = Pr(ι(U ) 6 ι(u) | Z = z) = Pr(ι(U ) 6 (1) ι(u) | Z = z ′ ) = Pr(h(z ¯ ′ , ι(U )) 6 h(z ¯ ′ , ι(u)) | Z = z ′ ) = FX|Z (h(z ¯ ′ , ι(u)) | z ′ ) = FX|Z (h(z ′ , u) | z ′ ), where equalities (1) are due to the premise that h ¯ is strictly increasing in its second argument, and equality (2) is due to the instrument indepen- dence in (34). (⇐) Let z 0 be an arbitrarily chosen element of the support of Z. Define ι : RM → R by ι(u) = FX|Z (h(z 0 , u) | z 0 ). For each η ∈ (0, 1), there exists ξ(η) on the support of X | (Z = z 0 ) such that η = FX|Z (ξ(η) | z 0 ). But then, for this ξ(η) as an element of the support of X | (Z = z 0 ), there exists υ(η) on the support of U (not necessarily unique) such that ξ(η) = h(z 0 , υ(η)), i.e., η = FX|Z (h(z 0 , υ(η)) | z 0 ). ¯ h As such an element υ(η) is not generally unique for a given η, the map (z, η) 7→ ¯ is h(z, υ(η)) need not be well-defined. However, we will show that such a map h indeed well-defined if FX|Z (h(z, u) | z) = FX|Z (h(z ′ , u) | z ′ ) holds for all z, z ′ and u. To show its well-definition, consider υ(η) and υ˜(η) such that ξ(η) = h(z 0 , υ(η)) = (∗) h(z 0 , υ˜(η)). Note that FX|Z (h(z, υ(η)) | z) = FX|Z (h(z 0 , υ(η)) | z 0 ) = FX|Z (ξ(η) | (∗) z 0 ) = FX|Z (h(z 0 , υ˜(η)) | z 0 ) = FX|Z (h(z, υ˜(η)) | z) holds where equalities (∗) are due to FX|Z (h(z, u) | z) = FX|Z (h(z ′ , u) | z ′ ). This implies that h(z, υ(η)) = h(z, υ˜(η)) by the assumption that FX|Z (· | z) is strictly increasing. Therefore, the mapping ¯ h ¯ : R2 → R defined by (z, η) 7→ h h(z, υ(η)) is indeed well-defined. ¯ we next show that h(z, u) = h(z, Having proven the well-definition of h, ¯ ι(u)) holds for all (z, u). ¯ ι(u)) | z) (1) Observe FX|Z (h(z, (2) = FX|Z (h(z, υ(ι(u))) | z) = (3) (4) (2) FX|Z (h(z 0 , υ(ι(u))) | z 0 ) = ι(u) = FX|Z (h(z 0 , u) | z 0 ) = FX|Z (h(z, u) | z), where ¯ equalities (2) are due equality (1) is due to the definition of the well-defined map h, to FX|Z (h(z, u) | z) = FX|Z (h(z ′ , u) | z ′ ), equality (3) is due to the definition of υ, 124 ¯ ι(u)) = h(z, u) for each and equality (4) is due to the definition of ι. This implies h(z, (z, u) by the assumption that FX|Z (· | z) is strictly increasing. ¯ constructed in this way is strictly monotone in its It remains to show that h ¯ η1 ) | z) (1) second argument. To see this, let 0 < η1 < η2 < 1. Then, FX|Z (h(z, = (2) (3) (3) FX|Z (h(z, υ(η1 )) | z) = FX|Z (h(z 0 , υ(η1 )) | z 0 ) = η1 < η2 = FX|Z (h(z 0 , υ(η2 )) | (2) (1) ¯ η2 ) | z), where equalities (1) are due to the z 0 ) = FX|Z (h(z, υ(η2 )) | z) = FX|Z (h(z, ¯ equalities (2) are due to FX|Z (h(z, u) | z) = definition of the well-defined map h, FX|Z (h(z ′ , u) | z ′ ), and equalities (3) are due to the definition of υ. It follows from ¯ η1 ) < h(z, this inequality that h(z, ¯ η2 ) by the assumption that FX|Z (· | z) is strictly ¯ is strictly increasing in its second argument. increasing. Therefore h  Lemma 15. Suppose that FX|Z (· | z) is strictly increasing.8 If Assumption 11 holds and ∂ Q (v ∂z X|Z | z) ̸= 0, then for the choice of c := [ ∂z ∂ QX|Z (v | z)]−1 , ∂ ∂ log fU |XZ (u | QX|Z (v | z), z) + c log fU |XZ (u | QX|Z (v | z), z) = 0 ∂x ∂z holds for all u on the support of FU |XZ ( · | QX|Z (v | z), z). Proof. Let V := FX|Z (h(Z, U ) | Z). Then, Assumption 11 and Lemma 14 imply that V does not depend on Z, thus V = ν(U ) for some function ν. Since (U, V ) = (U, ν(U )), we have (U, V ) ⊥⊥ Z by the instrument independence in the model (34). FU V |Z FU V Using this independence restriction yields FU Z|V = FV FZ = FV FZ = FU |V FZ = FU |V FZ|V , showing that U ⊥⊥ Z | V . Now, note that the map (v, z) 7→ (QX|Z (v | z), z) is well-defined and injective, owing to the well-definition and injectivity of the map v 7→ QX|Z (v | z) by the absolute continuity of FX|Z (note that the absolute continuity of FX|Z in particular implies that there is no singular part in its Lebesgue-Radon-Nikodym decomposition, thus the quantile is strictly increasing in v). Therefore, we have fU |XZ (u | QX|Z (v | z), z) = fU |V Z (u | v, z) for all z and v in their respective domains. Finally, use the 8 In addition, we also assume the following regularity conditions: FX|Z is absolutely continuous; QX|Z is continuously differentiable with respect to z; and fU |XZ is continuously differentiable with respect to (x, z) 125 independence condition U ⊥⊥ Z | V obtained in the last paragraph to conclude that fU |XZ (u | QX|Z (v | z), z) = fU |V (u | v), which is constant in z. Since fU |XZ (u | QX|Z (v | z), z) is constant in z, we have d 0 = fU |XZ (u | QX|Z (v | z), z) dz ∂ ∂ ∂ = QX|Z (v | z) · fU |XZ (u | QX|Z (v | z), z) + fU |XZ (u | QX|Z (v | z), z) ∂z ∂x ∂z under the differentiability of QX|Z with respect to z and the differentiability of fU |XZ ∂ with respect to (x, z). Divide this equation by [ ∂z QX|Z (v | z)] · fU |XZ (u | QX|Z (v | z), z) to prove the lemma.  Lemma 16. Suppose that FX|Z (· | z) is strictly increasing.9 If Assumption 11 does not hold, then there exists a set U¯ ⊂ U of positive measure such that d (46) fU |XZ (u | QX|Z (v | z), z) ̸= 0. dz holds for some z ∈ Z and for all u ∈ U. ¯ Proof. Let V := FX|Z (h(Z, U ) | Z). Write H(z, u) := FX|Z (h(z, u) | z). The differentiability of h and FX|Z imply H ∈ C 1 (RM +1 , R) where M is the dimension of U . As Assumption 11 does not hold, we have ∇z H(¯ z , u¯) ̸= 0 for some (¯ z , u¯) ∈ Z × U. Let j be a coordinate of U in h satisfying dh(z, u)/duj ̸= 0 at (¯ z , u¯). Then, we have ∇uj H(¯ z , u¯) ̸= 0. Thus we have sufficient conditions to invoke the Implicit Function Theorem to obtain a continuous function λ : Z ⊃ Bδ (¯ z ) → U defined in a neighborhood of z¯ such that H(z, λ(z)) = H(¯ z , u¯) =: v¯. It follows that a continuum of the level set of V = v¯ exist in a neighborhood of z = z¯. But this level set does not contain arbitrarily close horizontal or vertical displacements (z ±δ, u) and (z, u±δej ), due to ∇z H(¯ z , u¯) ̸= 0 and ∇uj H(¯ z , u¯) ̸= 0. These two facts (i.e., existence of a continuum of the level set and no containment of arbitrarily close horizontal or vertical displacements) imply that supp[(Z, Uj ) | V = v¯] ̸= supp[Z | V = v¯] × supp[Uj | V = 9 In addition, we also assume the following regularity conditions: h is continuously differentiable; FX|Z is continuously differentiable; QX|Z is continuously differentiable with respect to z; and fU |XZ is continuously differentiable with respect to (x, z). 126 v¯], i.e., the support of (Z, Uj ) | (V = v¯) is not rectangular. Since a rectangular support of the joint distribution is a necessary condition for independence, this implies that Z ⊥⊥ Uj | (V = v¯) does not hold, which in turn implies that Z ⊥⊥ U | (V = v¯) does not hold. Keeping the last result in mind, we now want to prove that (46) holds for some z ∈ Z and for all u ∈ U¯ with U¯ a set of positive measure. But by the assumptions of continuous differentiability of fU |XZ and QX|Z , it suffices to show that (46) holds for some (z, u) ∈ Z ×U, since the continuity of the derivatives then yields the correspond- ing result throughout a neighborhood of such u. Suppose, by way of contradiction, that d f dz U |XZ (u | QX|Z (v | z), z) = 0 holds for all (z, u) ∈ Z × U. As in the proof of Lemma 15, fU |XZ (u | QX|Z (v | z), z) = fU |V Z (u | v, z). Hence, d f dz U |V Z (u | v, z) = 0 holds for all (z, u) ∈ Z × U, showing that Z ⊥ ⊥ U | V . This is a contradiction with the conclusion of the previous paragraph.  Lemma 17. (i) Suppose that the set of structural models satisfies (34).10 Then, ∂ ∂ E [β(X, E)|X = x, Z = z] = E [Y |X = x, Z = z] + c E [Y |X = x, Z = z] − B(c, x, z) ∂x ∂z holds for any c ∈ R, where the bias term is [ { } ] ∂ ∂ B(c, x, z) = E Y log fU |XZ (U |x, z) + c log fU |XZ (U |x, z) |X = x, Z = z . ∂x ∂z (ii) If in addition v = FX|Z (x | z) and ∂ Q (v ∂z X|Z | z) ̸= 0, then Assumption 11 is ∂ sufficient to make B([ ∂z QX|Z (v | z)]−1 , QX|Z (v | z), z) = 0 for all structural models (g, FE|U ) ∈ G × F. ∂ (iii) Assumption 11 is also necessary to make B([ ∂z QX|Z (v | z)]−1 , QX|Z (v | z), z) = 0 for all structural models (g, FE|U ) ∈ G × F. 10 In addition, we also assume the following regularity assumptions: g is continuously differen- tiable with respect to x; and fU |XZ is continuously differentiable with respect to (x, z). 127 Proof. We write the observable mean regression E [Y |X = x, Z = z] as ∫ ∫ E [Y |X = x, Z = z] = g(x, ε)fE|U XZ (ε | u, x, z)dεfU |XZ (u | x, z)du ∫ ∫ = g(x, ε)fE|U (ε | u)dεfU |XZ (u | x, z)du, where the last equality is due to the instrumental independence in (34). Take deriva- tives to obtain ∂ E [Y |X = x, Z = z] ∂x ∫ ∫ ∫ ∫ ∂ = β(x, ε)fE|U (ε | u)dεfU |XZ (u | x, z)du + g(x, ε)fE|U (ε | u)dε fU |XZ (u | x, z)du ∂x [ ] ∂ = E [β(x, E)|X = x, Z = z] + E Y log fU |XZ (U | x, z)|X = x, Z = z , ∂x where the first equality is due to the differentiability of g and fU |XZ with respect to x as well as the L1 dominance of the integrand. By similar calculations, we have [ ] ∂ ∂ c E [Y |X = x, Z = z] = E Y c log fU |XZ (U | x, z)|X = x, Z = z . ∂z ∂z Combining these two inequalities yields ∂ ∂ E [β(X, E)|X = x, Z = z] = E [Y |X = x, Z = z] + c E [Y |X = x, Z = z] − B(c, x, z), ∂x ∂z where the bias term is [ { } ] ∂ ∂ B(c, x, z) = E Y c log fU |XZ (U |x, z) + log fU |XZ (U |x, z) |X = x, Z = z . ∂z ∂x This proves part (i) of the theorem. Apply Lemma 15 to prove part (ii). Lastly, we prove part (iii) of the theorem by applying Lemma 16. We prove the contrapositive statement, that if Assumption11 does not hold then there exists a structural model (g, FE|U ) ∈ G ×F such that B([∇z QX|Z (v | z)]−1 , QX|Z (v | z), z) ̸= 0. By Lemma 16, there exists a set U¯ ⊂ U of positive measure such that d fU |XZ (u | QX|Z (v | z), z) ̸= 0. dz holds for some z ∈ Z and for all u ∈ U. ¯ There exists a subset U˜ of U¯ with pos- itive measure on which the sign is positive or negative throughout. Without loss of generality, assume that the above expression is positive on U. ˜ Pick a structure 128 ∫ (g, FE|U ) ∈ G × F such that g(x, ε)fE|U (ε | u)dε is positive on U˜ and zero outside U˜ for some x.11 With this choice of (g, FE|U ), we have ∂ ∂ QX|Z (v | z)B([ QX|Z (v | z)]−1 , QX|Z (v | z), z) ∂z ∂z [ { } ] ∂ ∂ ∂ = E Y log fU |XZ (U | x, z) + QX|Z (v | z) · log fU |XZ (U | x, z) | X = x, Z = z ∂z ∂z ∂x ∫ [∫ ][ ] d = g(x, ε)fE|U (ε | u)dε fU |XZ (u | QX|Z (v | z), z) du > 0. dz This shows that B([∇z QX|Z (v | z)]−1 , QX|Z (v | z), z) ̸= 0.  Proposition 3 concerning the identifying equality of Corollary 4 (i) follows from this lemma. The conclusion concerning the identifying equality of Corollary 4 (ii) follows from similar lines of argument.  11 There exist many such structures (g, FE|U ). As one example of a way to construct such a } orthonormal basis B of the L (m) space. Form a function⟨ϕ and an indexed 2 structure, { consider an ⟩ family fE|U ( · | u) u∈U of density functions by linear combinations of B so that ϕ, fE|U ( · | u) > ⟨ ⟩ 0 for all u ∈ U˜ and ϕ, fE|U ( · | u) = 0 for all u ∈ U\U˜ by applying the Parseval-Bessel equality. Then, the required property is satisfied by any pair (g, FE|U ) with any function g such that g(x, · ) = ϕ for some x. Note g is required to be continuously differentiable only with respect to x, and thus will not be violated by this construction of g. 129 CHAPTER 3 Nonparametric Model Tests with Discrete Instruments 1. Introduction The following is a list of four common micro-econometric models: (i) Yi = βSi + ε(Ai ) – Constant coefficient model (ii) Yi = ϕ(Si ) + ε(Ai ) – Nonlinear separable model (iii) Yi = β(Ai )Si + ε(Ai ) – Random coefficient model (iv) Yi = ϕ(Si , Ai ) – Nonlinear nonseparable model where Yi , Si , and Ai denote observed outcome, observed endogenous choice, and unobserved heterogeneity, respectively. Finding out the correct model reveals the underlying economic structure. For example, if we find that model (i) or (ii) is correct, then we can deduce that variations in first-stage choice come from preferences or costs, rather than from heterogeneity in marginal returns. Moreover, finding out the correct model allows us to choose the correct specification on which to conduct statistical inferences.1 Therefore, there are both economic and econometric reasons why one may be interested in distinguishing these four types of models. This paper proposes a practically feasible method of testing to this end. In the ideal setting where a continuous instrument induces smooth first-stage ef- fects to construct a continuous control variable, the existing methods of nonparametric inference would achieve this objective. However, this ideal setting is not often the case, as one can see in the survey by Angrist and Krueger (2001) and Card (2001). 1 Various identification and estimation theories have been proposed. Model (ii) is studied by Blundell and Powell (2003), Florens (2003), Newey and Powell (2003), Hall and Horowitz (2005), Darolles et al. (2011), and Horowitz (2011). Model (iii) is studied by Garen (1984) and Heckman and Vytlacil (1998), and the associated IV quantile regression has been studied by Ma and Koenker (2006), Blundell and Powell (2007), Lee (2007), and Jun (2009). Model (iv) is studied by Chesher (2003, 2005), Altonji and Matzkin (2005), Chernozhukov and Hansen (2005), Chernozhukov et al. (2007), Horowitz and Lee (2007), Hoderlein and Mammen (2009), Imbens and Newey (2009), and Torgovitsky (2011). 130 In most empirical data covered in this survey, instruments exhibit only coarse and discrete variations. An important example of discrete instruments is the quarter of birth used by Angrist and Krueger (1991). With discrete instruments, structural features are only partially identified under the nonlinear nonseparable model (iv) (Chesher , 2005).2 In this light, I show how partially identified parameters can be used to distinguish the model types (i), (ii), and (iii) against (iv). Another practical limitation in common empirical data is the locality of instru- mental effects, which prohibits identification of global shape of structural functions. Again, an example is Angrist and Krueger in which the quarter-of-birth instrument affects schooling choices only near the 9th to the 10th grades (for most cohorts and states). While the locality often results in the smaller power of tests, we will empir- ically show that models (i) and (ii) are rejected for wage outcome as a function of years of education. Likewise, it will be empirically shown that models (i) and (iii) are rejected for infant birth weight as a function of smoking intensity. The objective of this paper is related to the literature on heterogeneity testing. For example, Chernozhukov and Hansen (2006) propose a test of heterogeneous quantile regression parameters under endogeneity. This idea is related to distinguishing the model (i) from model (iii). The objective of this paper is also related to the literature on nonparametric testing. For example, Horowitz and Lee (2009) propose tests of parametric quantile regressions against nonparametric family of quantile regressions under endogeneity. This idea can be used to distinguish the model (iii) from model (iv). The value added by this paper to this existing literature is twofold. First, we de- velop a single device that can be used to distinguish the four types of models, (i), (ii), (iii) and (iv). Second, more importantly, our method accounts for the aforementioned difficulty associated with discrete instruments. This practically important issue has 2 Also related is Jun, Pinkse, and Xu (2011). 131 not been explicitly addressed in the existing methods of heterogeneity or specification testing under endogeneity. The paper is organized as follows. We first discuss how to use partially identified parameters to distinguish the four models in Section 2. Estimates of these partially identified parameters are used to construct test statistics in Section 3. The tests are applied to two empirical problems in Section 4. Section 5 summarizes the paper. The appendix contains mathematical notes. 2. Bounds as Means of Specification Testing ∂ϕ How can we distinguish the four model types? Let ∂s denote the partial effect of the structural function ϕ with respect to s. Under each model from (i) to (iii), this partial effect reduces to ∂ϕ ∂s (s,a) =β under (i) the constant coefficient model, ∂ϕ ∂s (s,a) = ϕ′ (s) under (ii) the nonlinear separable model, and ∂ϕ ∂s (s,a) = β(a) under (iii) the linear random coefficient model. They are a constant in (i), a function of only s in (ii), and a function of only a in (iii). Models (i) and (iii) entail s-invariance of the partial effects, whereas models (i) and (ii) entail a-invariance of the partial effects. The next step is to translate these discriminatory characteristics into empirically testable restrictions. As noted in the introductory section, empirical data often ex- hibit only local instrumental effects near a certain point of s. Under this common limitation, the s-invariance and the a-invariance of the partial effects at a given s are characterized by ∂2 H0S : ϕ(s, a) = 0 for all a ∈ supp(A) at a given s, and ∂s2 ∂2 H0A : ϕ(s, a) = 0 for all a ∈ supp(A) at a given s, ∂s∂a respectively, where supp(·) denotes the support of the random variable. Rejection of H0S will result in falsification of model types (i) and (iii). Likewise, rejection of H0A will result in falsification of model types (i) and (ii). 132 Specification tests therefore require some knowledge of the partial effects, which is not observable to us. We use as an alternative device the average partial effect, denoted by [ ] ∂ AP E(s, z) := E ϕ(S, A) S = s, Z = z . ∂s ∂ϕ This AP E(s, z) contains aggregate information about ∂s over some set of population near the locality of s.3 The instrument Z as an additional conditioning variable plays a key role in recovering the control variable (cf. Imbens and Newey (2009)), which in ∂ϕ turn reveals variations of ∂s in a. Consequently, sensitivity of AP E(s, z) in z implies whether H0A is true or not. Point identification of the APE would allow us to test the hypotheses H0S and H0A easily. However, as noted in the introductory section, discreteness instruments are obstacles for recovering smooth counterfactuals. Consequently, the APE is generally at best partially identified, and our testing criteria will be based on the bounds of the APE, and their intersections. The following subsection discusses relationships be- tween AP E(s, z) and its bounds, which will establish empirically testable restrictions under the null hypotheses H0S and H0A . 2.1. Bounds of the APE and Their Implications for the Hypotheses. In the literature on nonseparable models, the first-stage restrictions are often used for identification of the second-stage features. Consider the following nonseparable first-stage: S = ψ(Z, V ), where Z is an instrumental variable which may be discrete, and V denotes the reduced-form unobserved heterogeneity. We impose the following standard restric- tions: 3 Even if ϕ is differentiable everywhere, this object need not exist in general due to the Borel- Kolmogorov paradox. But we assume it does exist, a sufficient condition for which is provided in Appendix Section 6. 133 Assumption 12 (Restrictions). (IM) First-Stage Monotonicity: ψ(z, v) 6 ψ(z, v ′ ) ⇐⇒ ψ(z ′ , v) 6 ψ(z ′ , v ′ ) for all z, z ′ ∈ supp(Z) and for all v, v ′ ∈ supp(V ). (IV) Instrument Independence: (A, V ) ⊥ ⊥ Z. (AC) Absolute Continuity: the distribution of S |Z=z is absolutely continuous with a convex support for every z ∈ supp(Z). (SI) Strong Instrument: ψ(z, v) is strictly increasing in z. (IM) states that the first-stage choice is monotone in some index of unobserved attributes, an assumption useful for constructing control variables (e.g., Imbens and Newey (2009)). (IV) states the standard instrument independence. (AC) requires that S is continuously distributed, which is an admissible abstraction of real data if S has many support points such as years of schooling or number of cigarettes. On the other hand, we maintain the assumption that Z is discrete since instruments often have much fewer support points. This (AC) guarantees that the distribution of S |Z=z has a probability density function denoted by fS|Z ( · | z), and FS|Z ( · | z) is invertible on its support. Under this restriction, we denote the quantile regression by −1 QS|Z := FS|Z . (SI) requires nontrivial first-stage effects. Alternatively, ψ(z, v) can be strictly decreasing in z, in which case we only need to relabel z into negative values. In practice, we are generally restricted to the locality of s at which (SI) is satisfied, e.g., s ≈ 9-10 for Angrist and Krueger (1991). As a device of characterizing model types, we use the notion of the local shapes of the structural function ϕ. To formalize the sense of locality, consider the following finite sets Zs,z + := {z ′ ∈ supp(Z) | z ′ > z, s ∈ supp(S | Z = z ′ )} and − Zs,z := {z ′ ∈ supp(Z) | z ′ < z, s ∈ supp(S | Z = z ′ )} − for a given (s, z) ∈ supp(S, Z). Zs,z + consists of the “right instruments,” whereas Zs,z consists of the “left instruments.” The right (respectively, left) instruments coun- terfactually induce higher (respectively, lower) treatment values under (SI). Define 134 + zs,z = min Zs,z + and s− − s,z = max Zs,z , the right and left instruments closest to z, re- spectively. Define the smallest local interval ( − I(s, z) := min{QS|Z (FS|Z (s | z) | zs,z + ), QS|Z (FS|Z (s | z) | zs,z )}, − ) max{QS|Z (FS|Z (s | z) | zs,z + ), QS|Z (FS|Z (s | z) | zs,z )} . bounded by the closest counterfactual levels of treatments induced by the instruments. Definition 1 (Local Shape). Suppose that ϕ ∈ C 2 . The following three local shapes are defined: ∂2ϕ (DR) Local Concavity: ∂s2 < 0 for all s′ ∈ I(s, z) and a ∈ supp(A). ′ (s ,a) ∂2ϕ (CR) Local Linearity: ∂s2 = 0 for all s′ ∈ I(s, z) and a ∈ supp(A). ′ (s ,a) ∂2ϕ (IR) Local Convexity: ∂s2 > 0 for all s′ ∈ I(s, z) and a ∈ supp(A). (s′ ,a) These local concavity, linearity, and convexity determine the order relation be- tween the APE and its upper and lower bounds. To discuss its intuition, let B R and B L denote the difference quotients of ϕ that can be formed by counterfactual variations in treatment S induced by the right and left instruments, respectively. Under the local concavity of ϕ, we will see B R < AP E < B L . The inequality will be reversed under the local convexity of ϕ. Moreover, we expect to see the equality B R = AP E = B L under the local linearity of ϕ. The following theorem formalizes these informal arguments. Theorem 5 (Bounds of the APE). Suppose that Assumption 12 holds. Then, the (in-)equalities + B(s; zs,z , z) < AP E(s, z) < B(s; s− s,z , z) under (DR) + B(s; zs,z , z) = AP E(s, z) = B(s; s− s,z , z) under (CR) + B(s; zs,z , z) > AP E(s, z) > B(s; s− s,z , z) under (IR) 135 + hold for zs,z = min Zs,z + and s− − s,z = max Zs,z , where B is defined with θz (s) := FS|Z (s | z) by [ ] [ ] E Y S = QS|Z (θz (s) | z ′ ) , Z = z ′ − E Y S = QS|Z (θz (s) | z) , Z = z B(s; z ′ , z) := . QS|Z (θz (s) | z ′ ) − QS|Z (θz (s) | z) − Remark 12. When Zs,z + (respectively Zs,z ) is an empty set, there is no lower bound (respectively no upper bound) under Definition 1 (DR). The opposite is the case under Definition 1 (IR). Remark 13. As the support of Z becomes richer to constitute a cluster point at z, these bounds asymptotically converge to the true APE under some regularity assumptions. Remark 14. Notice that formula of the bound B(s; z ′ , z) resembles that of the LATE (Imbens and Angrist, 1994). It takes the form of the “heterogeneous reduced- form effects” divided by the “heterogeneous first-stage effects.” This parallels the LATE which takes the form of the “average reduced-form effects” divided by the “average first-stage effects.” Implication for Hypothesis Testing: The theorem suggests that comparing the order relation between B(s; z ′ , z) and B(s; z ′′ , z), which can be identified from observed data, reveals the local shape of the structural function ϕ. If data indicates B(s; z ′ , z) < B(s; z ′′ , z) or B(s; z ′ , z) > B(s; z ′′ , z), then we are in a position to reject (CR), and hence to reject H0S . That is, variations of B in its second argument (i.e., z ′ or z ′′ ) are used to test the hypothesis H0S . On the other hand, variations of B in its third argument (i.e., z) can be used to test H0A . Note that under the null hypothesis H0A , the partial effect is invariant in variations in a. Varying z while fixing s at a given locality causes variations in the control variable v which in turn affects a under endogeneity. Thus, AP E(s, z) should be insensitive to variations in z under H0A . This in turn implies non-emptiness of ∩ − + the intersection bounds z∈Z [B(s; zs,z , z), B(s; zs,z , z)] under H0A . Contrapositively, emptiness of these intersection bounds leads to rejection of H0A . 136 Example – Model (i): For an instance of the linear constant coefficient model      Y = βS + A 1 1 with (A, V ) ∼ N 0,   ,  S = πZ + V 1 1 some calculation yields QS|Z (θ | z) = πz + QV (θ) and E[Y | S = s, Z = z] = (1 + β)s − πz. Substituting these expressions in the bound formula shows B(s; z ′ , z) = β for all (s, z ′ , z). The bounds B(s; z ′ , z) universally point-identify the constant partial derivative, which is β. As the upper and lower bounds coincide at β, we fail to reject H0S . Moreover, by ∩ non-emptiness of the intersection bounds z∈Z [B(s; zs,z+ , z), B(s; s− s,z , z)] = {β}, we fail to reject H0A too.  Example – Model (ii): For an instance of the nonlinear regression model      √ Y = β S + A 1 1 with (A, V ) ∼ N 0,   ,  S = πZ + V 1 1 some calculation yields QS|Z (θ | z) = πz + QV (θ) and E[Y | S = s, Z = z] = √ β s + s − πz. Substituting these expressions in the bound formula shows √ √ ′ π(z ′ − z) + s − s B(s; z , z) = β for all (s, z ′ , z). π(z ′ − z) Therefore, taking (s; z ′ , z) = (π; 2, 1) for example yields ( ) 1 B(π; 2, 1) < AP E(π, 1) < B π; , 1 if β > 0. | {z } | {z } 2 √ = √β ( 2−1) = √β 1 | {z } π π 2 √ = √βπ (2− 2) As we have a strict inequality between upper and lower bounds, we reject H0S . On the other hand, since the set enclosed by the intersection bounds containing the z- invariant APE is non-empty, we fail to reject H0A .  Example – Model (iii): For an instance of the linear random coefficient model      Y = (βA)S + A 1 1 with (A, V ) ∼ N 0,   ,  S = πZ + V 1 1 137 some calculation yields QS|Z (θ | z) = πz + QV (θ) and E[Y | S = s, Z = z] = (βs + 1)(s − πz). Substituting these expressions in the bound formula shows B(s; z ′ , z) = β · (s − πz) for all (s, z ′ , z). On the other hand, the true APE takes the following form: AP E(s, z) = βE[A | S = s, Z = z] = βE[A | V = s−πz] = β·(s−πz) for all (s, z). Therefore, the bounds B(s; z ′ , z) point-identify the heterogeneous AP E(s, z) for every (s, z). As the upper and lower bounds coincide at β(s − πz), we fail to reject H0S . On ∩ + the other hand, given the empty intersection bounds z∈Z [B(s; zs,z , z), B(s; s− s,z , z)] = ∩ z∈Z {β · (s − πz)} = ϕ under π ̸= 0, we reject H0 .  A These three examples reconfirm the relevance of the empirically testable hypothe- ses for our main goal of specification testing. Indeed, we correctly fail to reject H0S under models (i) and (iii). Similarly, we correctly fail to reject H0A under models (i) and (ii). 2.2. Numerical Illustrations. This subsection graphically illustrates the im- plications of Theorem 5 for specification tests. Consider the following family of data- generating models as a numerical example: 5ps Y = ϕ(S, A) := (S − 5)1−ps epa A + (1 − pa )A, 1 − ps where ps and pa are parameters that generate heterogeneity across S and A, re- spectively. These parameters reduce the model into the four types in the following manner: ps = 0 pa = 0 =⇒ (i) Linear constant coefficient model ps ̸= 0 pa = 0 =⇒ (ii) Nonlinear separable model ps = 0 pa ̸= 0 =⇒ (iii) Linear random coefficient model ps ̸= 0 pa ̸= 0 =⇒ (iv) Nonlinear nonseparable model In order to keep the true APE analytically tractable, we assume the linear first stage S = 10 + Z + V where Z is uniform on {−1, 0, 1} and (A, V ) follows a joint normal distribution with positive covariance. 138 Figure 3.1 depicts the true AP E(10, z) together with its bounds B(10; z + 1, z) and B(10; z − 1, z). Notice that the bound inequalities are strict under the nonlinear models (ii) and (iv), whereas the upper and lower bounds exactly coincide under the linear models (i) and (iii). This result agrees with the (in-)equalities in Theorem 5, and these characteristics can be used in an attempt to reject H0S , or as a way to distinguish types (ii) and (iv) from types (i) and (iii). Also notice that the true AP E(10, z) is non-constant in z under the heterogeneous models (iii) and (iv), whereas the true AP E(10, z) is constant across z under the homogeneous models (i) and (ii). The non-constancy of AP E(10, z) across z under the heterogeneous models (iii) and (iv) causes empty intersection bounds across z, and this characteristic can be used in an attempt to reject H0A , or as a way to distinguish types (iii) and (iv) from types (i) and (ii). 3. The Test Statistics The previous section analyzed the role that bounds play in distinguishing the four types of the models under discrete instruments. A sample analog of the bound in Theorem 5 is b z ′ , z) = B(s; (En,hn [Y |S = qb(s, z; z ′ ), Z = z ′ ] − En,hn [Y |S = s, Z = z ]) / (b q (s, z; z ′ ) − s) , ( ) where qb(s, z; z ′ ) is an rn -consistent estimator of QS|Z FS|Z (s | z) | z ′ , and En,hn de- notes the Nadaraya-Watson estimator ( n ( ) ) ( n ) ∑ Si − s ∑ ( Si − s ) En,hn [Y |S = s, Z = z ] := K 1(Zi = z)Yi / K 1(Zi = z) . i=1 hn i=1 hn Because the econometric and statistical literature has suggested numerous estimators of conditional distribution and quantile regressions, we will not elaborate on large √ sample properties of qb(s, z; z ′ ). It suffices to assume rn = n, which is in general true when Z is discrete and is compatible with the following assumption. ( ) Assumption 13 (Large Sample). (i) qb(s, z; z ′ )−QS|Z FS|Z (s | z) | z ′ = Op (rn−1 ). (ii) hn → 0, rn → ∞, h3n rn2 → ∞, nhh → ∞, and nhn rn−2 → 0 as n → ∞. (iii) ∫ K is symmetric and Lipschitz-continuous with K = 1, supp(K) ⊂ (−1, 1), and ∥K∥2+δ < ∞ for some δ > 0. (iv) (Ai , Vi , Zi ) is i.i.d. (v) With ε := Y −E[Y | S, Z], the 139 (i) Linear Constant Coefficient (iii) Linear Random Coefficient (ps = 0.0, pa = 0.0) (ps = 0.0, pa = 0.1) (ii) Nonlinear Separable (iv) Nonlinear Nonseparable (ps = 0.5, pa = 0.0) (ps = 0.5, pa = 0.1) (ii) Nonlinear Separable (iv) Nonlinear Nonseparable (ps = −0.5, pa = 0.0) (ps = −0.5, pa = −0.1) Figure 3.1. Relationships between140 the true AP E(10, z) and their bounds. skedastic function σ 2 (s, z) := E[ε2 | S = s, Z = z] is twice continuously differentiable with bounded second derivatives, and E[|ε|2+δ | S = s, Z = z] < ∞ for some δ > 0. (vi) fSZ is twice continuously differentiable with respect to s with bounded second derivatives. (vii) E[Y | S = s, Z = z] is twice continuously differentiable with respect to s with a bounded second derivative. Under this set of assumptions, asymptotic distributions of the bound estimators b z ′ , z) are obtained in Lemmas 22–24 in the appendix section. These asymp- B(s; totic distributions will be used in turn to derive the asymptotic behavior of the test statistics to be presented in Sections 3.1 and 3.2. Recall the two practical limitations discussed in the introductory section, discrete instruments and local instrumental effects. We fix short-hand notations in this light. Because we focus on discrete instrumental variation, let supp(Z) = {z1 , · · · , zK } with z1 < · · · < zK . Moreover, because we focus on local instrumental effects, let s ∈ supp(S) be fixed hereafter.4 3.1. Test of the Hypothesis H0S . For a test of model types (i) and (iii), we consider the null hypothesis ∂2 H0S : ϕ(s, a) = 0 for all a ∈ supp(A), ∂s2 against the list of two alternatives ∂2 H−1 S : ϕ(s, a) < 0 for some a over a non-null set, and ∂s2 ∂2 H+1 S : 2 ϕ(s, a) > 0 for some a over a non-null set. ∂s By Theorem 5, the null hypothesis H0S implies the empirically testable hypothesis ′ H0S : B(s; zk+1 , zk ) = B(s; zk−1 , zk ) for each k = 2, · · · , K − 1. 4 In case of non-local instrumental effects, one can extend our test statistics by summing / integrating over s to gain power. Summing over a finite set of s is a straightforward extension of our results. A drawback with integrating over continuous s is that the stochastic process based on our nonparametric estimation does not converge weakly to a tight process (hence non-Donsker). One way to overcome this difficulty is to directly obtain the extreme value distribution of the limit process, e.g., Chernozhukov, Lee, and Rosen (2009) Section 3.5. 141 ′ Contrapositively, rejection of this H0S concludes rejection of the original H0S . Figure 3.2 depicts relative orderings of the bounds B(s; · , · ) under each case of these null and alternative hypotheses. A natural approach is to use a measure of discrepancy between upper and lower bounds to form a test statistic. As such, consider the simple test statistic √ ( ) b b nhn B(s; zk+1 , zk ) − B(s; zk−1 , zk ) Tbk,n S := √ Γk,k−1 + Γk,k+1 − 2Γk,k−1,k+1 for any k ∈ {2, · · · , K − 1}, where an (2K − 2) × (2K − 2) matrix Γ is given in Section 7 in the appendix. Large negative values of TbnS reject H0S in favor of H−1 ′ S . Large positive values of TbnS reject H0S in favor of H+1 ′ S . This test statistic asymptotically follows the standard normal distribution. Proposition 4 (Test of H0S ). Suppose that Assumptions 12 and 13 hold. Then, Tbk,n d S −→ Z ∼ N (0, 1) under H0S . 3.2. Test of the Hypothesis H0A . For a test of model types (i) and (ii), we consider the null hypothesis ∂ H0A : ϕ(s, a) = βs for some constant βs for [PA|S=s ]-a.s. a ∂s A test statistic will be constructed through the following logic. Given discrete z, H0A implies ′ H0A : AP E(s, z) = βs for some constant βs for all z ∈ supp(Z | S = s). ′ Suppose that ϕ exhibits (DR) or (CR), i.e., (IR) is ruled out. Theorem 5 under H0A yields B(s; zk+1 , zk ) 6 βs for all k = 1, · · · , K − 1 and βs 6 B(s; zk−1 , zk ) for all k = 2, · · · , K. Satisfaction of these inequalities implies the following empirically testable hypothesis: ′′ H0A : B(s; zk+1 , zk ) 6 B(s; zk′ −1 , zk′ ) for all k, k ′ 142 Fail to reject H0S Reject H0S in favor of H−1 S Reject H0S in favor of H+1 S Fail to reject H0A Fail to reject H0A Fail to reject H0A Fail to reject H0S Reject H0S in favor of H−1 S Reject H0S in favor of H+1 S Reject H0A Reject H0A Reject H0A Figure 3.2. Graphical characterizations of specification tests. ′′ ′ Rejection of H0A implies rejection of H0A , which in turn implies rejection of H0A . Figure 3.2 depicts the various cases in which the last null hypothesis is rejected or fails to be rejected. ′′ The last testable form of the hypothesis H0A requires that no lower bound exceeds any uppper bound. Consider a test statistic as a measure of the largest deviation from this requirement √ { ( )} TbnA (W ) := nhn max ′ W k,k ′ b B(s; z , k+1 kz ) − b B(s; z ′ , k −1 kz ′ ) k,k where W is a weighting matrix. Consider the random variable T A (W ) := maxk,k′ {Wk,k′ (Lk − Uk′ )} , where (L1 , U2 , L2 , U3 , L3 , · · · , UK ) is a 2(K − 1)-tuple random vector following the normal law N (0, Γ) with Γ given in Section 7 in the appendix. Proposition 5 (Test of H0A ). Suppose that Assumptions 12 and 13 hold. Then, ( ) b −1 lim sup P r Tn > FT A (1 − α) H A 6 α n→∞ H∈HA 0 143 holds for all α ∈ (0, 1) under (DR) or (CR). Remark 15. Under (IR) or (CR), we can develop a similar test statistic by b zk+1 , zk ) and B(s; switching the roles of B(s; b zk′ −1 , zk′ ). The size of this test is in general conservative. The exact size α is achieved under the case of point-identification for every z ∈ supp(Z), that is, when Assumption 1 (CR) holds. Tendency of under-rejection becomes more likely as partially identified regions become wider. 3.3. Monte Carlo Evidences. Consider the family of data-generating models introduced in Section 2.2: 5ps Y = (S − 5)1−ps epa A + (1 − pa )A 1 − ps Recall that ps and pa are parameters that induce heterogeneity in the dimensions of S and A, respectively. The null hypothesis H0S is true when ps = 0. Similarly, the null hypothesis H0A is true when pa = 0. We expect that the power of the test of H0S increases as ps deviates away from zero, and that the power of the test of H0A increases as pa deviates away from zero. Figure 3.3 draws MC-simulated power curves of the 95% level tests across dif- ferent values of ps and pa for various sample sizes of 1,000, 2,000, 5,000, and 10,000 observations. The size is indeed correct under both tests with about 5% rejection probability at ps = 0 and pa = 0 for the tests of H0s and H0a , respectively. The power approaches one as the sample size increases when ps ̸= 0 and pa ̸= 0 under the tests of H0s and H0a , respectively. These simulation results evidence the power as well as the unbiasedness of the proposed tests. The next section further evidences their power with empirical data. 4. Testing with Empirical Data In this section, we apply the proposed methods to two economic problems, both of which have been extensively studied in the empirical literature. The first application 144 95% Level Test of H0S 95% Level Test of H0A Figure 3.3. Simulated power curves of the specification tests. analyzes wage returns to years of schooling (Section 4.1). We will reject H0A , hence precluding (i) the linear constant coefficient model and (ii) the nonlinear separable model. The second application analyzes infant birth weight as an outcome of smoking intensity (Section 4.2). We will reject H0S , hence precluding (i) the linear constant coefficient model and (iii) the linear random coefficient model. 4.1. Returns to Schooling. Empirical assessment of the marginal returns to schooling has long been of interest in labor economics. An important scene in the literature was the emergence of natural experiments by instrumental variables (IV) as sources of exogenous variations in endogenous choice, e.g., Angrist and Krueger (1991). Interpretations of the IV estimator has been discussed in the econometric literature, e.g., Angrist and Imbens (1995); (1997). For instance, (1997) shows that 2SLS identifies the average partial effect under a special case of structural type (iii). In this way, the knowledge of the true model type may allow a sensible structural interpretation of the common statistical parameters. 145 We attempt to test the structural types (i), (ii), and/or (iii) using the data of Angrist and Krueger (1991). They used the quarter of birth as an exogenous source of variation in years of schooling in order to study partial effects of education on wage outcomes. The data consists of three decades of birth cohorts with log wage outcome (Y ), endogenous choice of years of schooling (S), and several attribute characteristics including quarter of birth (Z) and state of birth as well as other standard covariates. Before starting the tests, we note that the instrumental effects should be limited to the locality of the 9th to 10th years of schooling, where compulsory education laws are supposed to induce difference by quarter of birth, although there are some variations across time and states. Lleras-Muney (2002) describes the details of this policy. Recall that this practical limitation was the reason for our setting of the null hypotheses H0S and H0A at fixed s instead of global s. It is necessary to focus on this locality where the instrument indeed matters, as characterized by the testable restriction Local First-Stage Effects := QS|Z (θz (s) | z ′ ) − QS|Z (θz (s) | z) ̸= 0. In order to confirm this restriction of Assumption 12 (SI), we estimate the hetero- geneous first-stage effects across (s, z) using smooth quantile regression estimation5 for the subsample of individuals born in Arkansas, Kentucky, or Tennessee, the three states associated with strongest first-stage effects (Hoogerheide et al. , 2007). Fig- ure 3.4 shows that the first-stage effects in these three states are nonzero at s = 9, whereas the instrument may be irrelevant at the college level (s = 12 and s = 16). These differential first-stage effects coincide with the pattern implied by compulsory education policies. We drop the first quarter because the transition between the first and second quarters induces no difference even at s = 9. Therefore, we use the instrumental variations among supp(Z) = {2, 3, 4}. 5 This follows from Horowitz (1998) replacing the indicator function with a smooth function in the objective function of quantile estimators. We do so for our treatment of the years S as a continuous variable. 146 Figure 3.4. Heterogeneous first-stage effects across years of schooling and quarters of birth. Sample: place of birth from Arkansas, Kentucky, or Tennessee for all birth years. The set estimates and confidence intervals for AP E(9.5, z) for z ∈ {2, 3, 4} are depicted in Figure 3.5. Comparing Figure 3.5 with Figure 3.1, we can conjecture that the closer model in a statistical sense is (iii) the linear random coefficient model. Moreover, comparing Figure 3.5 with Figure 3.2, we can see that the bottom left picture in Figure 3.2 best resembles Figure 3.5 in a statistical sense. According to Figure 3.5, the upper and lower bounds of AP E(9.5, 3) do not seem to significantly differ from each other, hence we are not likely to reject H0S . On the other hand, the figure shows that AP E(9.5, z) tends to decrease in z, which implies that we will probably reject H0A . Let us formalize these visual analyses as follows. First, consider the hypothesis ∂2 H0S : ∂s2 ϕ(9.5, a) = 0 for all a ∈ supp(A). Recall that large negative (respectively, positive) values of the test statistic TbnS reject the null hypothesis H0S of constant returns in favor of the alternative hypothesis H−1 S of decreasing returns (respectively, H+1 S of increasing returns). Table 3.1 shows that the test statistic is negative, but not significantly so. Hence, we fail to reject H0S . Second, consider a test of the hypothesis 147 Figure 3.5. Bounds and confidence regions of AP E(9.5, z) for z = 2, 3, 4. Sample: place of birth in Arkansas, Kentucky, or Tennessee. Null Hypothesis Alternative Hypothesis Test Statistic p-value H−1 S : ϕ is concave in S H0S : ∂ϕ ∂s is constant across S TbnS = −0.90 0.367 H+1 S : ϕ is convex in S H0A : ∂ϕ ∂s is constant across A H1A : ∂ϕ ∂s varies with A TbnA = 106.38 0.018∗∗ Table 3.1. Results of specification tests for Angrist & Krueger (1991) data. H0A : ∂ ∂s ϕ(9.5, a) = βs for some constant βs for all a ∈ supp(A). Table 3.1 shows that the test statistic is significantly large at the level of 5%. We therefore reject H0A . These results imply that the wage production function may be linear in years of schooling. However, the constant returns to education are likely to be heterogeneous, conceivably across unobserved abilities. The structural specifications of (i) the con- stant coefficient model and (ii) the nonlinear separable model are rejected, whereas (iii) the linear random coefficient model survived our attempt at rejection. 148 4.2. Smoking and Infant Birth Weights. The effects of smoking by pregnant women on infant birth weights have been studied by an extensive body of both the health economic literature (e.g., Rosenzweig and Schultz, 1983; Evans and Ringel, 1999; and Lien and Evans, 2005) and the medical literature (e.g., Lightwood, Phibbs, and Glantz, 1999). In this section, we analyze the structural type of the birth-weight production function that takes cigarettes as a negative production factor. Cigarette excise tax rates are used to instrument for variations in cigarette consumption, fol- lowing the approach of Evans and Ringel (1999). From natality data of the National Vital Statistics System of the National Center for Health Statistics, we extract a random sample of 100,000 observations from 1989 to 1999. Since three categories of instrumental variations suffice for our purpose, we categorize the tax rate into three groups, high for 50–100%, medium for 25–50%, and low for 0–25%, which are labeled as z=1, 2, and 3, respectively. Cigarette excise tax rates have nearly continuous variations. This instrumental variable, unlike the example of Section 4.1 or many other empirical data, is rich enough to allow nonparametric inferences without partial identification. We nevertheless consider this application for the sake of demonstrating the power of our test for H0S . In order to confirm the restriction of Assumption 12 (SI), we estimate the hetero- geneous first-stage effects across (s, z) using smooth quantile regression estimation. Figure 3.6 suggests that the first-stage effects are nonzero for 2 6 s 6 8. Both positive and negative variations are large around s = 5. Therefore, we focus on a neighborhood of s = 5. The set estimates and confidence intervals for AP E(5, z) are depicted in Figure 3.7. We can conjecture the true structural type by comparing Figure 3.7 with Figure 3.1 or 3.2. In comparison with Figure 3.1, we see that the graph representing (ii) the nonlinear separable model most closely resembles Figure 3.7. The upper and lower bounds of AP E(5, 2) seem to differ significantly from each other, hence we will probably reject H0S . On the other hand, the figure does not imply a tendency of either increase or decrease for the true AP E(5, z) in z, thus we are not likely to reject H0A . 149 Figure 3.6. Heterogeneous first-stage effects across number of cigarettes smoked and cigarette tax rate. Figure 3.7. Bounds and confidence regions of AP E(5, z) for z = 1, 2, 3 with h = 3. Let us formalize these visual analyses. First, consider a test of hypothesis ∂2 H0S : ∂s2 ϕ(5, a) = 0 for all a ∈ supp(A). Table 3.1 shows that the test statistic is 150 s=4 Null Hypothesis Alternative Hypothesis Test Statistic p-value H−1 S : ϕ is concave in S H0S : ∂ϕ ∂s is constant across S TbnS = 5.86 0.000∗∗∗ H+1 S : ϕ is convex in S H0A : ∂ϕ ∂s is constant across A H1A : ∂ϕ ∂s varies with A TbnA = 0.00 0.625 s=5 Null Hypothesis Alternative Hypothesis Test Statistic p-value H−1 S : ϕ is concave in S H0S : ∂ϕ ∂s is constant across S TbnS = 4.07 0.000∗∗∗ H+1 S : ϕ is convex in S H0A : ∂ϕ ∂s is constant across A H1A : ∂ϕ ∂s varies with A TbnA = 0.00 0.618 s=6 Null Hypothesis Alternative Hypothesis Test Statistic p-value H−1 S : ϕ is concave in S H0S : ∂ϕ ∂s is constant across S TbnS = 2.06 0.040∗∗ H+1 S : ϕ is convex in S H0A : ∂ϕ ∂s is constant across A H1A : ∂ϕ ∂s varies with A TbnA = 0.00 0.612 Table 3.2. Results of specification tests for smoking and infant birth weights. significantly positive. Hence, we reject H0S in favor of H+1 S , that is, convexity. Second, consider a test of the hypothesis H0A : ∂ ∂s ϕ(5, a) = βs for some constant βs for all a ∈ supp(A). Table 3.2 shows that the test statistic is not significant. We therefore fail to reject H0A . These results imply that the birth-weight production function is convex in the number of cigarettes, and the shapes of these functions are perhaps homogeneous 151 across individuals. In other words, the negative marginal effects of smoking diminish in number of cigarettes, but may not vary across unobserved physiological character- istics of mothers. The structural specifications of (i) the constant coefficient model and (iii) the linear random coefficient model are rejected, whereas (ii) the nonlinear separable model survived our attempt at rejection. 5. Conclusion This paper proposed methods of specification testing that are effective even when instruments exhibit only discrete variations as in the case of many empirical data. To reflect this limitation, we developed an idea to use partially identified parameters to construct test statistics. The tests are designed to distinguish (i) the linear constant coefficient model, (ii) the nonlinear separable model, and (iii) the linear random coefficient model, against the alternative of (iv) the nonlinear nonseparable model. We showed the empirical relevance of the method. Specifications (i) and (ii) are rejected for log wages as a function of years of education. Specifications (i) and (iii) are rejected for infant birth weights as a function of smoking intensity. 6. Appendix: Well-Defined Conditional Expectations Define Ψ := ∂ ∂s ϕ(S, A). Assume that there exists a function h ∈ L1 (FSZ ) such that ∫ ∫ h(s, z)dFSZ (s, z) = ψdFΨSZ (ψ, s, z) F R×F holds for every Borel set F ∈ B(R2 ). Under this assumption, we can define the conditional expectation [ ] ∂ E ϕ(S, A) (S, Z) = · := h, ∂s which do not suffer from the Borel-Kolmogorov paradox. We similarly assume that other conditional expectations and conditional distributions used through this paper are well-defined. 152 ˜ 7. Appendix: Covariance Matrices Γ and Γ The covariance matrix Γ is given by   Γ12      Γ21 Γ213 O         Γ213 Γ23       Γ32 Γ324       Γ324 Γ34       ..   O .    ΓK,K−1 where ( ) σ 2 (QS|Z FS|Z (s | zk ) | zk′ , zk′ ) Γk,k′ := ∥K∥22 ( ) [ ( ) ]2 fSZ (QS|Z FS|Z (s | zk ) | zk′ , zk′ ) QS|Z FS|Z (s | zk ) | zk′ − s σ 2 (s, zk ) + ∥K∥22 [ ( ) ]2 fSZ (s, zk ) QS|Z FS|Z (s | zk ) | zk′ − s σ 2 (s, zk ) Γk,k′ ,k′′ := ∥K∥22 [ ( ) ][ ( ) ] fSZ (s, zk ) QS|Z FS|Z (s | zk ) | zk′′ − s QS|Z FS|Z (s | zk ) | zk′ − s ˜ is the middle 2(K − 2) × 2(K − 2) elements of Γ. The covariance matrix Γ 8. Appendix: Auxiliary Lemmas 8.1. Lemma 18. Lemma 18. Suppose that Assumption 12 holds. Then ( ) ( ) fV |SZ v QS|Z (θ | z ′ ), z ′ = fV |SZ v QS|Z (θ | z), z holds for all θ ∈ (0, 1), for all v ∈ supp(V ), and for all z, z ′ ∈ supp(Z). Proof. With this specification, we have (1) FS|Z (ψ(z, v) | z) = P r(ψ(z, V ) 6 ψ(z, v) | Z = z) = P r(ψ(z, V ) 6 ψ(z, v) | Z = z ′ ) (2) = P r(ψ(z ′ , V ) 6 ψ(z ′ , v) | Z = z ′ ) = FS|Z (ψ(z ′ , v) | z ′ ) for all (a, v) ∈ supp(A, V ) and z, z ′ ∈ supp(Z), where step (1) is due to (IV), and step (2) is due to (IM). But then, the random variable defined by Θ := FS|Z (ψ(z, V ) | z) does not depend on z, hence Θ = h(V ) for some function h. Since (V, Θ) = 153 fV Θ|Z (V, h(A, V )), (IV) implies (V, Θ) ⊥⊥ Z. But then, fV |Θ = fΘ fZ = fV Θ fΘ Z f = fV |Θ fZ = fV |Θ fZ|Θ , showing that V ⊥⊥ Z | Θ. −1 By (AC), FS|Z ( · | z) is invertible on its support, and the map θ 7→ FS|Z (θ | z) ( ) −1 is injective. But then, so is the map (θ, z) 7→ FS|Z (θ | z), z . This implies that ( ) −1 fV |SZ v FS|Z (θ | z), z = fV |ΘZ (v | θ, z) holds for all a v, θ, and z. Now apply V ⊥⊥ ( ) −1 Z | Θ to this equality to get fV |SZ v FS|Z (θ | z), z = fV |ΘZ (v | θ, z) = fV |Θ (v | θ). Therefore, it follows that ( ) ( ) −1 −1 fV |SZ v FS|Z (θ | z ′ ), z ′ − fV |SZ v FS|Z (θ | z), z = fV |Θ (v | θ) − fV |Θ (v | θ) = 0.  8.2. Lemma 19. Lemma 19. Suppose that (IV) holds. Then, ∫ E[Y | S = s, Z = z] = ϕ(s, a)fA|V (a | v)fV |SZ (v | s, z)d(a, v) holds for all (a, s, v, z) for which the conditional distributions are well-defined. Proof. First, note that fA|V SZ (a | v, s, z) = fA|V SZ (a | v, ψ(z, v), z) = fA|V Z (a | v, z) holds on the support of fV |SZ . Using this fact, we obtain ∫ ∫ E[Y | S = s, Z = z] = ϕ(s, a)fA|V Z (a | v, z)fV |SZ (v | s, z)dadv Next, using (IV) reduces fA|V Z into fA|V , hence proving the lemma.  8.3. Lemma 20. Lemma 20. Suppose that Assumption 12 holds. If z, z ′ ∈ supp(Z) and θ ∈ (0, 1), then E[ϕ(s′ , A) − ϕ(s, A) | S = s, Z = z] = E[Y | S = s′ , Z = z ′ ] − E[Y | S = s, Z = z] and E[ϕ(s′ , A) − ϕ(s, A) | S = s′ , Z = z ′ ] = E[Y | S = s′ , Z = z ′ ] − E[Y | S = s, Z = z] hold, where s = QS|Z (θ | z) and s′ = QS|Z (θ | z ′ ). 154 Proof. First, note that the convex support condition in (AC) guarantees the −1 invertibility of FS|Z ( · | z), hence FS|Z ( · | z) is well-defined on (0, 1). For notational simplicity, write ∫ ∫ ′ −1 −1 Λ(z , z, θ) := ϕ(FS|Z (θ | z ′ ), a)fA|V (a | v)fV |SZ (v | FS|Z (θ | z), z)dadv. Then, Lemma 19 states −1 −1 E[Y | S = FS|Z (θ | z ′ ), Z = z ′ ] = Λ(z ′ , z ′ , θ) and E[Y | S = FS|Z (θ | z), Z = z] = Λ(z, z, θ). On the other hand, Lemma 18 implies Λ(z ′ , z ′ , θ) − Λ(z ′ , z, θ) = 0 and Λ(z, z ′ , θ) − Λ(z, z, θ) = 0. Combining these two results yields −1 E[Y | S = FS|Z −1 (θ | z ′ ), Z = z ′ ] − E[Y | S = FS|Z (θ | z), Z = z] = Λ(z ′ , z ′ , θ) − Λ(z ′ , z, θ) + Λ(z ′ , z, θ) − Λ(z, z, θ) | {z } | {z } 0 (∗) where the (∗) part is ∫ ∫ [ ] −1 −1 −1 (∗) = ϕ(FS|Z (θ | z ′ ), a) − ϕ(FS|Z (θ | z), a) fA|V (a | v)fV |SZ (v | FS|Z (θ | z), z)dadv [ ] −1 −1 −1 = E ϕ(FS|Z (θ | z ′ ), A) − ϕ(FS|Z (θ | z), A) S = FS|Z (θ | z), Z = z −1 −1 Now, substitute s′ = FS|Z (θ | z ′ ) and s = FS|Z (θ | z) to obtain E[Y | S = s′ , Z = z ′ ] − E[Y | S = s, Z = z] = E[ϕ(s′ , A) − ϕ(s, A) | S = s, Z = z]. Similarly, we write −1 E[Y | S = FS|Z −1 (θ | z ′ ), Z = z ′ ] − E[Y | S = FS|Z (θ | z), Z = z] = Λ(z ′ , z ′ , θ) − Λ(z, z ′ , θ) + Λ(z, z ′ , θ) − Λ(z, z, θ) | {z } | {z } (∗∗) 0 where the (∗∗) part is ∫ ∫ [ ] −1 −1 −1 (∗∗) = ϕ(FS|Z (θ | z ′ ), a) − ϕ(FS|Z (θ | z), a) fA|V (a | v)fV |SZ (v | FS|Z (θ | z ′ ), z ′ )dadv [ ] −1 −1 −1 = E ϕ(FS|Z (θ | z ′ ), A) − ϕ(FS|Z (θ | z), A) S = FS|Z (θ | z ′ ), Z = z ′ −1 −1 Now, substitute s′ = FS|Z (θ | z ′ ) and s = FS|Z (θ | z) to obtain E[Y | S = s′ , Z = z ′ ] − E[Y | S = s, Z = z] = E[ϕ(s′ , A) − ϕ(s, A) | S = s′ , Z = z ′ ]. 155  8.4. Lemma 21. Lemma 21 (Monotonicity). (i) Suppose that (IV) and (SI) hold. Then, for a fixed θ ∈ (0, 1), QS|Z (θ | z) is increasing in z. (ii) If in addition (AC) holds, then θ ∈ (0, 1), QS|Z (θ | z) is strictly increasing in z. Proof. Let z ′ > z. Since ψ is increasing by (SI), P r(ψ(z, V ) 6 s) > P r(ψ(z ′ , V ) 6 s) for all s. But since P r(ψ(z, V ) 6 s) = P r(ψ(Z, V ) 6 s | Z = z) = FS|Z (s | z) and similarly P r(ψ(z ′ , V ) 6 s) = FS|Z (s | z ′ ) by (IV), the above inequality reduces to FS|Z (s | z) > FS|Z (s | z ′ ) for all s. This inequality in turn yields the set relation { } { } s FS|Z (s | z) > θ ⊃ s FS|Z (s | z ′ ) > θ . But then, { } { } −1 FS|Z (θ | z) = inf s FS|Z (s | z) > θ 6 inf s FS|Z (s | z ′ ) > θ = FS|Z −1 (θ | z ′ ) which proves part (i). To prove part (ii), assume (AC) in addition. Note that (AC) implies the absolute continuity of the distribution of ψ(z, V ) with a convex support for each z ∈ supp(Z). But then, z < z ′ and (SI) yield P r(ψ(z, V ) 6 s) > P r(ψ(z ′ , V ) 6 s), hence θ := FS|Z (s | z) > FS|Z (s | z ′ ) by the same argument as in part (i). Since (AC) implies ∂ F −1 (θ ∂θ S|Z | z) > 0, we obtain −1 −1 −1 −1 −1 −1 FS|Z (θ | z ′ ) = FS|Z (FS|Z (FS|Z (θ | z ′ ) | z) | z) > FS|Z (FS|Z (FS|Z (θ | z ′ ) | z ′ ) | z) = FS|Z (θ | z), which proves part (ii).  8.5. Lemma 22. Lemma 22 (Asymptotic Distribution). Suppose that Assumption 13 holds. If ( ) √ ( ) b z ′ , z) − B(s; z ′ , z) −→ z ′ ̸= z and QS|Z FS|Z (s | z) | z ′ ̸= s, then nhn B(s; d ξ∼ N (0, V ), where ( ) ( ) fSZ (s, z)σ 2 (QS|Z FS|Z (s | z) | z ′ , z ′ ) + fSZ (QS|Z FS|Z (s | z) | z ′ , z ′ )σ 2 (s, z) V := ∥K∥22 ( ) [ ( ) ]2 fSZ (s, z)fSZ (QS|Z FS|Z (s | z) | z ′ , z ′ ) QS|Z FS|Z (s | z) | z ′ − s 156 p ( ) Proof. By Assumption 13 (i), qb(s, z; z ′ ) −→ QS|Z FS|Z (s | z) | z ′ is granted. b z ′ , z) to that of Therefore, it suffices to show convergence of the numerator of B(s; B(s; z ′ z) in law. I use some variants of the standard asymptotic theories of kernel- based estimators (e.g., Pagan and Ullah, 1999). First, the Lyapunov’s CLT with Assumption 13 (ii)–(vi) yields ∑n ( ) √ √1 nhn i=1 K Si −s hn 1 (Zi = z) εi nhn (En,h [Y | S = s, Z = z] − E[Y | S = s, Z = z]) = ∑n ( ) 1 nhn i=1 K Si −s hn 1 (Zi = z) ∑n ( ) √ 1 n(z)hn i=1 K Si −s hn 1 (Zi = z) εi ξ1 d =√ ∑n ( ) −→ √ n(z) 1 K Si −s 1 (Zi = z) fZ (z)fS|Z (s | z) n n(z)hn i=1 hn ∑n ( ) i=1 1 (Zi = z) and ξ1 ∼ N 0, fS|Z (s | z) ∥K∥2 σ (s, z) . Similarly, 2 2 where n(z) := we have √ ( ( ) ) nhn En,h [Y | S = qb(s, z; z ′ ), Z = z ′ ] − E[Y | S = QS|Z FS|Z (s | z) | z ′ , Z = z ′ ] ∑n ( ) q (s,z;z ′ ) Si −b √1 nhn i=1 K hn 1 (Zi = z ′ ) εi = ( ) + 1 ∑n q (s,z;z ′ ) Si −b nhn i=1 K hn 1 (Z i = z ′) √ ( ( ) ) nhn E[Y | S = qb(s, z; z ′ ), Z = z ′ ] − E[Y | S = QS|Z FS|Z (s | z) | z ′ , Z = z ′ ] | {z } =op (1) First, we show that the last term is op (1). To see this, note that by the mean value expansion together with Assumption 13 (i), (ii), and (vii), we have √ ( ( ) ) nhn E[Y | S = qb(s, z; z ′ ), Z = z ′ ] − E[Y | S = QS|Z FS|Z (s | z) | z ′ , Z = z ′ ] −1 −2 1/2 = n1/2 h1/2 n Op (rn ) = Op ((nhn rn ) ) = op (1). To study the asymptotic behavior of ( ) 1 ∑ n Si − qb(s, z; z ′ ) √ K 1 (Zi = z ′ ) εi nhn i=1 hn 157 rewrite it as ( ( )) 1 ∑ Si − QS|Z FS|Z (s | z) | z ′ n √ K 1 (Zi = z ′ ) εi nhn i=1 hn | {z } d √ ′ −→ fZ (z )ξ2 ( [ ) ( ( ) )] 1 ∑ Si − QS|Z FS|Z (s | z) | z ′ n Si − qb(s, z; z ′ ) +√ K −K 1 (Zi = z ′ ) εi nhn i=1 hn hn | {z } op (1) Again, the Lyapunov’s CLT with Assumption 13 (ii)–(vi) yields convergence of the first term as ( ( )) 1 ∑ Si − QS|Z FS|Z (s | z) | z ′ ( ) n √ K 1 Zi = z ′ εi nhn i=1 hn √ ( ( )) n(z ′ ) 1 ∑n Si − QS|Z FS|Z (s | z) | z ′ ( ) d √ = √ K 1 Zi = z ′ εi −→ fZ (z ′ )ξ2 n n(z ′ )hn i=1 hn ( ( ) ( ) ) where ξ2 ∼ N 0, fS|Z (QS|Z FS|Z (s | z) | z ′ | z ′ ) ∥K∥22 σ 2 (QS|Z FS|Z (s | z) | z ′ , z ′ ) . On the other hand, [ ( ) ( ( ) )] 1 ∑ Si − QS|Z FS|Z (s | z) | z ′ n Si − qb(s, z; z ′ ) √ K −K 1 (Zi = z ′ ) εi nhn i=1 hn hn is op (1) To see this, let M denote the Lipschitz constant of K as granted by Assump- tion 13 (iii), and note that ( ) ( ( ) ) − b ′ − | | ′ S i q (s, z; z ) S i QS|Z F S|Z (s z) z K −K hn hn qb(s, z; z ′ ) − Q (F (s | z) | z ′ ) S|Z S|Z 6M = Op (h−1 n rn ) −1 hn by Assumption 13 (i). Note that this expression is independent of the subscript i. Another application of the Lyapunov’s CLT with Assumption 13 (ii)–(vi) yields √ 1 ∑ ∑ n n n(z ′ ) 1 √ 1 (Zi = z ′ ) εi = √ ′ 1 (Zi = z ′ ) εi = Op (1), n i=1 n n(z ) i=1 158 Hence, it follows that [ ( ) ( ( ) )] 1 ∑ Si − QS|Z FS|Z (s | z) | z ′ n Si − qb(s, z; z ′ ) √ K −K 1 (Zi = z ′ ) εi nhn i=1 hn hn ( 3 2 −1/2 ) = h−1/2 n Op (h−1 −1 n rn )Op (1) = Op (hn rn ) = op (1) by Assumption 13 (ii), as desired. A similar decomposition method will show ( ) 1 ∑ n Si − qb(s, z; z ′ ) K 1 (Zi = z ′ ) nhn i=1 hn ∑ ( ) n(z ′ ) 1 n Si − qb(s, z; z ′ ) ( ) 1 (Zi = z ′ ) −→ fZ (z ′ )fS|Z (QS|Z FS|Z (s | z) | z ′ | z ′ ) p = K n n(z ′ )hn i=1 hn ( ) = fSZ (QS|Z FS|Z (s | z) | z ′ , z ′ ) Putting all these pieces together, we obtain √ ( ( ) ) nhn En,h [Y | S = qb(s, z; z ′ ), Z = z ′ ] − E[Y | S = QS|Z FS|Z (s | z) | z ′ , Z = z ′ ] √ d f (z ′ )ξ2 −→ ( Z ) fSZ (QS|Z FS|Z (s | z) | z ′ , z ′ ) ( ( ) ( ) ) where ξ2 ∼ N 0, fS|Z (QS|Z FS|Z (s | z) | z ′ | z ′ ) ∥K∥22 σ 2 (QS|Z FS|Z (s | z) | z ′ , z ′ ) . ( ) Since z ̸= z ′ , QS|Z FS|Z (s | z) | z ′ ̸= s, hn → 0, and supp(K) ⊂ (−1, 1) by assump- tion, √ nhn (En,h [Y | S = s, Z = z] − E[Y | S = s, Z = z]) and √ ( ( ) ) nhn En,h [Y | S = qb(s, z; z ′ ), Z = z ′ ] − E[Y | S = QS|Z FS|Z (s | z) | z ′ , Z = z ′ ] are asymptotically independent. Using this fact, we see that √ ξb := nhn ((En,h [Y | S = qb(s, z; z ′ ), Z = z ′ ] − En,h [Y | S = s, Z = z]) ( ) −(E[Y | S = QS|Z FS|Z (s | z) | z ′ , Z = z ′ ] − E[Y | S = s, Z = z])) √ √ d fZ (z ′ )ξ2 fZ (z)ξ1 −→ ξ := ( ) − fSZ (QS|Z FS|Z (s | z) | z , z ) ′ ′ fSZ (s, z) where ( ( ) ) ∥K∥22 σ 2 (QS|Z FS|Z (s | z) | z ′ , z ′ ) ∥K∥22 σ 2 (s, z) ξ∼N 0, ( ) + fSZ (QS|Z FS|Z (s | z) | z ′ , z ′ ) fSZ (s, z) 159 Lastly, noting Assumption 13 (i) and (ii) yields √ ( ) b ′ nhn B(s; z , z) − B(s; z , z) ′ √ ( ( )) ξb nhn qb(s, z; z ′ ) − QS|Z FS|Z (s | z) | z ′ = + ( ( ) ) qb(s, z; z ′ ) − s (b q (s, z; z ′ ) − s) QS|Z FS|Z (s | z) | z ′ − s ξb Op ((nhn rn−2 )1/2 ) = + ( ( ) ) qb(s, z; z ′ ) − s (b q (s, z; z ′ ) − s) QS|Z FS|Z (s | z) | z ′ − s d ξ −→ ( ) ∼ N (0, V ) QS|Z FS|Z (s | z) | z ′ − s where ∥K∥22 σ 2 (QS|Z (FS|Z (s|z)|z ′ ),z ′ ) ∥K∥2 σ 2 (s,z) fSZ (QS|Z (FS|Z + fSZ2 (s,z) (s|z)|z ′ ,z ′ ) ) V = [ ( ) ]2 QS|Z FS|Z (s | z) | z ′ − s  8.6. Lemma 23. Lemma 23 (Asymptotic Joint Distribution). Suppose that Assumption 13 holds. ( ) ( ) If z∗ < z < z ∗ and QS|Z FS|Z (s | z) | z∗ < s < QS|Z FS|Z (s | z) | z ∗ , then      √ b z ∗ , z) − B(s; z ∗ , z) B(s; Σ (s, z) Σ (s, z) nhn   −→ b ∼ N 0,   d 11 12 b B(s; z∗ , z) − B(s; z∗ , z) Σ12 (s, z) Σ22 (s, z) where ( ) ( ) fSZ (s, z)σ 2 (QS|Z FS|Z (s | z) | z ∗ , z ∗ ) + fSZ (QS|Z FS|Z (s | z) | z ∗ , z ∗ )σ 2 (s, z) Σ11 (s, z) := ∥K∥22 ( ) [ ( ) ]2 fSZ (s, z)fSZ (QS|Z FS|Z (s | z) | z ∗ , z ∗ ) QS|Z FS|Z (s | z) | z ∗ − s σ 2 (s, z) Σ12 (s, z) := ∥K∥22 [ ( ) ][ ( ) ] fSZ (s, z) QS|Z FS|Z (s | z) | z ∗ − s QS|Z FS|Z (s | z) | z∗ − s ( ) ( ) fSZ (s, z)σ 2 (QS|Z FS|Z (s | z) | z∗ , z∗ ) + fSZ (QS|Z FS|Z (s | z) | z∗ , z∗ )σ 2 (s, z) Σ22 (s, z) := ∥K∥22 ( ) [ ( ) ]2 fSZ (s, z)fSZ (QS|Z FS|Z (s | z) | z∗ , z∗ ) QS|Z FS|Z (s | z) | z∗ − s Proof. By using a similar argument to the proof of Lemma 22 (i), we obtain     √ b z ∗ , z) − B(s; z ∗ , z) B(s; ξb∗ nhn  b z; z ∗ , z∗ )−1   = Q(s,  + Op ((nhn rn−2 )1/2 ) b z∗ , z) − B(s; z∗ , z) B(s; ξb∗ 160 where b z; z ∗ , z∗ ) Q(s, := q (s, z; z ∗ ) − s, qb(s, z; z∗ ) − s) diag (b p ( ( ) ( ) ) −→ diag QS|Z FS|Z (s | z) | z ∗ − s, QS|Z FS|Z (s | z) | z∗ − s is nonsingular with probability approaching one under the assumption that QS|Z ( ) ( ) FS|Z (s | z) | z∗ < s < QS|Z FS|Z (s | z) | z ∗ , and √ ξb∗ := nhn ((En,h [Y | S = qb(s, z; z ∗ ), Z = z ∗ ] − En,h [Y | S = s, Z = z]) ( ) −(E[Y | S = QS|Z FS|Z (s | z) | z ∗ , Z = z ∗ ] − E[Y | S = s, Z = z])) √ ξb∗ := nhn ((En,h [Y | S = qb(s, z; z∗ ), Z = z∗ ] − En,h [Y | S = s, Z = z]) ( ) −(E[Y | S = QS|Z FS|Z (s | z) | z∗ , Z = z∗ ] − E[Y | S = s, Z = z])) ( ) Since z ̸= z ′ , QS|Z FS|Z (s | z) | z ′ ̸= s, hn → 0, and supp(K) ⊂ (−1, 1) by assump- tion, √ nhn (En,h [Y | S = s, Z = z] − E[Y | S = s, Z = z]) , √ ( ( ) ) nhn En,h [Y | S = qb(s, z; z ∗ ), Z = z ∗ ] − E[Y | S = QS|Z FS|Z (s | z) | z ∗ , Z = z ∗ ] , and √ ( ( ) ) nhn En,h [Y | S = qb(s, z; z∗ ), Z = z∗ ] − E[Y | S = QS|Z FS|Z (s | z) | z∗ , Z = z∗ ] are asymptotically independent. Therefore, by a similar argument to the proof of Lemma 22 (i), the Lyapunov’s CLT together with Assumption 13 (ii)-(vi) yields        ξb∗ ξ ∗ Λ111 (s, z) Λ112 (s, z)   −→ p   ∼ N 0,   ξb∗ ξ∗ Λ112 (s, z) Λ122 (s, z) 161 where ( ) ∥K∥22 σ 2 (QS|Z FS|Z (s | z) | z ∗ , z ∗ ) ∥K∥22 σ 2 (s, z) Λ111 (s, z) := ( ) + fSZ (QS|Z FS|Z (s | z) | z ∗ , z ∗ ) fSZ (s, z) ∥K∥22 σ 2 (s, z) Λ112 (s, z) := fSZ (s, z) ( ) ∥K∥22 σ 2 (QS|Z FS|Z (s | z) | z∗ , z∗ ) ∥K∥22 σ 2 (s, z) Λ122 (s, z) := ( ) + fSZ (QS|Z FS|Z (s | z) | z∗ , z∗ ) fSZ (s, z) Therefore,     √ b z ∗ , z) − B(s; z ∗ , z) B(s; ( )−1 ξ ∗ nhn   −→ d b z; z ∗ , z∗ ) plim Q(s,   b B(s; z∗ , z) − B(s; z∗ , z) ξ∗    Σ11 (s, z) Σ12 (s, z) ∼ N 0,   Σ12 (s, z) Σ22 (s, z)  8.7. Lemma 24. Lemma 24 (Asymptotic Joint Distribution). Suppose that Assumption 13 holds ( ) ( for s and all z ∈ supp(Z), and assume that QS|Z FS|Z (s | zk ) | zi ̸= QS|Z FS|Z (s | zk′ ) | zj ) for k ̸= k ′ , k ̸= i, and k ′ ̸= j. If supp(Z) = {z1 , . . . , zK } with z1 < · · · < zK and s ∈ supp(S), then   b z2 , z1 ) − B(s; z2 , z1 ) B(s; ··· b z1 , zK ) − B(s; z1 , zK ) B(s;   √  .. ..  nhn · vec   . .. . .     b zK , z1 ) − B(s; zK , z1 ) B(s; ···b zK−1 , zK ) − B(s; zK−1 , zK ) B(s; (K−1)×(K−1)       Σ(z 1 ) · · · O     .. ..   d   −→ b ∼ N 0,  . .   . . .       O · · · Σ(zK ) (K−1)2 ×(K−1)2 where for each k = 1, · · · , K, the (i, i)–element of Σ(zk ) is ( ) ( ) 2 fSZ (s, zk )σ 2 (QS|Z FS|Z (s | zk ) | zi , zi ) + fSZ (QS|Z FS|Z (s | zk ) | zi , zi )σ 2 (s, zk ) ∥K∥2 ( ) [ ( ) ]2 fSZ (s, zk )fSZ (QS|Z FS|Z (s | zk ) | zi , zi ) QS|Z FS|Z (s | zk ) | zi − s for i = 1, · · · , k − 1, the (i, i)–element of Σ(zk ) is ( ) ( ) 2 fSZ (s, zk )σ 2 (QS|Z FS|Z (s | zk ) | zi+1 , zi+1 ) + fSZ (QS|Z FS|Z (s | zk ) | zi+1 , zi+1 )σ 2 (s, zk ) ∥K∥2 ( ) [ ( ) ]2 fSZ (s, zk )fSZ (QS|Z FS|Z (s | zk ) | zi+1 , zi+1 ) QS|Z FS|Z (s | zk ) | zi+1 − s 162 for i = k + 1, · · · , K − 1, the (i, j)–element of Σ(zk ) is σ 2 (s, zk ) ∥K∥2 2 [ ( ) ][ ( ) ] fSZ (s, zk ) QS|Z FS|Z (s | zk ) | zi − s QS|Z FS|Z (s | zk ) | zj − s for i ̸= j with i, j < k and similarly for other cases. Proof. The upper (K − 1) × K block of the left-hand-side matrix consists of elements of the form √ ( ) ξbik b zi , zk ) − B(s; zi , zk ) = nhn B(s; + Op ((nhn rn−2 )1/2 ) qb(s, zk ; zi ) − s where, as in the proof of Lemma 23, p ( ) qb(s, zk ; zi ) − s −→ QS|Z FS|Z (s | zk ) | zi − s ̸= 0 and √ ξbik := nhn ((En,h [Y | S = qb(s, zk ; zi ), Z = zi ] − En,h [Y | S = s, Z = zk ]) ( ) −(E[Y | S = QS|Z FS|Z (s | zk ) | zi ) , Z = zi ] − E[Y | S = s, Z = zk ])) for which Assumption 13 (ii)–(vi) facilitated a sufficient condition for the Lyapunov’s CLT to be invoked. It remains to investigate in the elements of the variance-covariance matrix, but this follows from the same argument as the proof of Lemma 23. Lastly, we show that all the elements off the block diagonal are zero. To this end, it suffices to observe asymptotic independence between ξbik and ξbjk′ for k ̸= k ′ . But this asymptotic independence clearly holds by noting that the data is i.i.d., ( ) hn → 0 as n → ∞, the assumption of the lemma that QS|Z FS|Z (s | zk ) | zi ̸= ( ) ( ) QS|Z FS|Z (s | zk′ ) | zj , and that k ̸= k ′ implies zk ̸= zk′ , s ̸= QS|Z FS|Z (s | zk ) | zi ( ) for all i ̸= k, and s ̸= QS|Z FS|Z (s | zk′ ) | zj for all j ̸= k ′ .  9. Appendix: Proofs of the Theorem and the Propositions 9.1. Proof of Theorem 5. Proof. We prove the statement for the case of Assumption 1 (DR). Other cases can be similarly proved by simply replacing inequalities. Let z ′ := zs,z + and s′ = 163 −1 ( ) −1 ( ) FS|Z FS|Z (s | z) | z ′ . Since s = FS|Z FS|Z (s | z) | z , Lemma 21 implies s′ > s. Then (DR) or (CR) implies ϕ(s′ , a) − ϕ(s, a) ∂ < ϕ(s, a) s′ − s ∂S for all a. But then, ∫ E [ϕ(s′ , A) − ϕ(s, A) | S = s, Z = z] ϕ(s′ , a) − ϕ(s, a) = fA|SZ (a | s, z)da s′ − s s′ − s ∫ [ ] ∂ ∂ < ϕ(s, a)fA|SZ (a | s, z)da = E ϕ(s, A) S = s, Z = z . ∂S ∂S Moreover, Lemma 20 yields E [ϕ(s′ , A) − ϕ(s, A) | S = s, Z = z] = E[Y | S = s′ , Z = z ′ ] − E[Y | S = s, Z = z] −1 −1 since s = FS|Z (θ | z) and s′ = FS|Z (θ | z ′ ) where θ = FS|Z (s | z). Substituting this equality into the last inequality yields [ ] E[Y | S = s′ , Z = z ′ ] − E[Y | S = s, Z = z] ∂ FT−1 A (1 − α) | H 6 P r Tn∗ > FT−1 A (1 − α) | H for any α ∈ (0, 1) under H ∈ H0A . Taking the supremum over H0A yields ( ) ( ) sup P r Tbn > FT−1A (1 − α) | H 6 sup P r Tn∗ > FT−1 A (1 − α) | H H∈HA 0 H∈H0A d Lastly, it follows from Tn∗ −→ T that ( ) ( ) lim sup P r Tbn > FT−1 A (1 − α) | H 6 lim sup P r Tn∗ > FT−1 A (1 − α) | H n→∞ H∈HA n→∞ H∈HA 0 0 ( ) = P r T > FT−1 A (1 − α) = α  166 Bibliography Adams, Peter, Michael D. Hurd, Daniel McFadden, Angela Merrill, and Tiago Ribeiro (2003) “Healthy, Wealthy and Wise? Tests for Direct Causal Paths between Health and Socioeconomic Status,” Journal of Econometrics, Vol. 112 (1), pp. 3–56. Aguirregabiria, Victor (2010) “Another Look at the Identification of Dynamic Dis- crete Decision Processes: An Application to Retirement Behavior,” Journal of Business and Economic Statistics, Vol. 28 (2), pp. 201–218. Aguirregabiria, Victor, and Pedro Mira (2007) “Sequential Estimation of Dynamic Discrete Games,” Econometrica, Vol. 75 (1), pp. 1–53. Ahn, Seung C. and Peter Schmidt (1995) “Efficient Estimation of Models for Dynamic Panel Data,” Journal of Econometrics, Vol. 68 (1), pp. 5–27. Ai, Chunrong, and Xiaohong Chen (2003) “Efficient Estimation of Models with Con- ditional Moment Restrictions Containing Unknown Functions,” Econometrica, Vol. 71 (6), pp. 1795–1843. Almond, Douglas (2006) “Is the 1918 Influenza Pandemic Over? Long-Term Effects of In Utero Influenza Exposure in the Post-1940 U.S. Population,” Journal of Political Economy, Vol. 114 (4), pp. 672–712. Almond, Douglas and Bhashkar Mazumder (2005) “The 1918 Influenza Pandemic and Subsequent Health Outcomes: An Analysis of SIPP Data,” American Economic Review, Vol. 95 (2), Papers and Proceedings, pp. 258–262. Altonji, Joseph G. and Rosa L. Matzkin (2005) “Cross Section and Panel Data Esti- mators for Nonseparable Models with Endogenous Regressors,” Econometrica, Vol. 73 (4), pp. 1053-1102. Altonji, Joseph G., Hidehiko Ichimura, and Taisuke Otsu (2011) “Estimating Deriva- tives in Nonseparable Models with Limited Dependent Variables,” Econometrica, 167 forthcoming. Anderson, T.W. and Cheng Hsiao (1982) “Formulation and Estimation of Dynamic Models Using Panel Data,” Journal of Econometrics, Vol. 18 (1), pp. 47–82. Andrews, Donald W.K. (2011) “Examples of L2 -Complete and Boundedly-Complete Distributions,” Cowles Foundation Discussion Paper No. 1801. Angrist, Joshua D. and Guido W. Imbens (1995) “Two-Stage Least Squares Esti- mation of Average Causal Effects in Models with Variable Treatment Intensity,” Journal of the American Statistical Association, Vol. 90, No. 430, pp. 431–442. Angrist, Joshua D. and Alan B. Krueger (1991) “Does Compulsory School Attendance Affect Schooling and Earnings?” Quarterly Journal of Economics, Vol. 106, No. 4, pp. 979–1014. Angrist, Joshua D. and Alan B. Krueger (2001) “Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments.” Journal of Economic Perspectives, Vol. 15, No. 4, pp. 69–85. Arcidiacono, Peter and Robert A. Miller (2011) “CCP Estimation of Dynamic Dis- crete Choice Models with Unobserved Heterogeneity,” Econometrica, forthcoming. Arellano, Manuel and Stephen Bond (1991) “Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations,” Re- view of Economic Studies, Vol. 58 (2), pp. 277–297. Arellano, Manuel and St´ephane Bonhomme (2009) “Identifying Distributional Char- acteristics in Random Coefficients Panel Data Models,” Cemmap Working Paper No. 22/09. Arellano, Manuel and Olympia Bover (1995) “Another Look at the Instrumental Variable Estimation of Error-Components Models,” Journal of Econometrics, Vol. 68 (1), pp. 29–51. Arellano, Manuel and Jinyong Hahn (2005) “Understanding Bias in Nonlinear Panel Models: Some Recent Developments,” Invited Lecture, Econometric Society World Congress, London, August 2005. 168 Bai, Jushan (2009) “Panel data models with interactive fixed effects,” Econometrica, Vol 77 (4), pp. 1229–1279. Bajari, Patrick, C. Lanier Benkard, and Jonathan Levin (2007) “Estimating Dynamic Models of Imperfect Competition,” Econometrica, Vol. 75 (5), pp. 1331-1370. Baltagi, Badi H. (1985) “Pooling Cross-Sections with Unequal Time-Series Lengths,” Economics Letters, Vol. 18 (2), pp. 133–136. Baltagi, Badi H. and Young-Jae Chang (1994) “Incomplete Panels: A Comparative Study of Alternative Estimators for the Unbalanced One-Way Error Component Regression Model,” Journal of Econometrics, Vol. 62 (2), pp. 67–89. Belzil, Christian and J¨orgen Hansen (2002) “Unobserved Ability and the Return to Schooling,” Econometrica, Vol. 70 (5), pp. 2075–2091. Bester, Alan C. and Christian Hansen (2009) “A Penalty Function Approach to Bias Reduction in Nonlinear Panel Models with Fixed Effects,” Journal of Business and Economic Statistics, Vol. 27 (2), pp. 131-148. Bhattacharya, Debopam (2008) “Inference in panel data models under attrition caused by unobservables,” Journal of Econometrics, Vol. 144 (2), pp. 430–446. Blundell, Richard and Stephen Bond (1998) “Initial Conditions and Moment Restric- tions in Dynamic Panel Data Models,” Journal of Econometrics, Vol. 87 (1), pp. 115–143. Blundell, Richard, Xiaohong Chen, and Dennis Kristensen (2007) “Semi- Nonparametric IV Estimation of Shape-Invariant Engel Curves,” Econometrica, Vol. 75 (6), pp. 1613–1669. Blundell, Richard., and James L. Powell (2003) “Endogeneity in Nonparametric and Semiparametric Regression Models,” in Advances in Economics and Econometrics: Theory and Applications, Vol. 2, ed. by M. Dewatripont, L.-P. Hansen, and S. J. Turnovsky. Cambridge, U.K.: Cambridge University Press, pp. 312-357. Blundell, Richard., and James L. Powell (2007) “Censored Regression Quantiles with Endogenous Regressors,” Journal of Econometrics, Vol. 141, No. 1, pp. 65–83. 169 Bonhomme, St´ephane (2010) “Functional Differencing,” Working Paper, New York University. Bound, John, Todd Stinebrickner, and Timothy Waidmann (2010) “Health, Economic Resources and the Work Decisions of Older Men,” Journal of Econometrics, Vol. 156, No. 1, pp. 106–129. Brown, Stephen J., William Goetzmann, Roger G. Ibbotson, and Stephen A. Ross (1992) “Survivorship Bias in Performance Studies,” Review of Financial Studies, Vol. 5 (4), pp. 553-580 Cameron, Stephen V. and James J. Heckman (1998) “Life Cycle Schooling and Dy- namic Selection Bias: Models and Evidence for Five Cohorts of American Males,” Journal of Political Economy, Vol. 106 (2), pp. 262–333. Card, David (2001) “Estimating the Return to Schooling: Progress on Some Persis- tent Econometric Problems,” Econometrica, Vol. 69, No. 5, pp. 1127–1160. Carrasco, Marine, Jean-Pierre Florens, and Eric Renault (2007) “Linear Inverse Prob- lems in Structural Econometrics Estimation Based on Spectral Decomposition and Regularization,” in J.J. Heckman and E.E. Leamer (ed.) Handbook of Economet- rics, Vol. 6, Ch. 77. Case, Anne, Angela Fertig, and Christina Paxson (2005) “The Lasting Impact of Childhood Health and Circumstance,” Journal of Health Economics, Vol. 24 (2), pp. 365–389. Chamberlain, Gary (2010) “Binary Response Models for Panel Data: Identification and Information,” Econometrica, Vol. 78 (1), pp.159–168. Chen, Xiaohong (2007) “Large Sample Sieve Estimation of Semi-Nonparametric Mod- els,” in J.J. Heckman and E.E. Leamer (ed.) Handbook of Econometrics, Vol. 6, Ch. 76. Chernozhukov, Victor and Christian Hansen (2005) “An IV Model of Quantile Treat- ment Effects,” Econometrica, Vol. 73 (1), pp. 245–261. 170 Chernozhukov, Victor and Christian Hansen (2006) “Instrumental Quantile Regres- sion Inference for Structural and Treatment Effect Models,” Journal of Economet- rics, Vol. 132, No. 2, pp. 491–525. Chernozhukov, Victor, Guido W. Imbens, and Whitney K. Newey (2007) “Instrumen- tal Variable Estimation of Nonseparable Models,” Journal of Econometrics, Vol. 139 (1), pp. 4–14. Chernozhukov, Victor, Sokbae Lee, and Adam M. Rosen (2009) “Intersection Bounds: Estimation and Inference,” Working Paper Chernozhukov, Victor, Iv´an Fern´andez-Val, Jinyong Hahn, and Whitney K. Newey (2009) “Identification and Estimation of Marginal Effects in Nonlinear Panel Mod- els,” CeMMAP Working Papers CWP05/09. Chernozhukov, Victor, I´ van Fern´andez-Val, Jinyong Hahn, and Whitney Newey (2010) “Average and Quantile Effects in Nonseparable Panel Models,” MIT Work- ing Paper. Chesher, Andrew (2003) “Identification in Nonseparable Models,” Econometrica, Vol. 71 (5), pp. 1405–1441. Chesher, Andrew (2005) “Nonparametric Identification under Discrete Variation,” Econometrica, Vol. 73 (5), pp. 1525–1550. Contoyannis, Paul, Andrew M. Jones, and Nigel Rice (2004) “The dynamics of health in the British Household Panel Survey,” Journal of Applied Econometrics, Vol. 19 (4), pp. 473–503. Crawford, Gregory S. and Matthew Shum (2005) “Uncertainty and Learning in Phar- maceutical Demand,” Econometrica, Vol. 73 (4), pp. 1137–1173. Cunha, Flavio, and James J. Heckman (2008) “Formulating, Identifying and Esti- mating the Technology of Cognitive and Noncognitive Skill Formation,” Journal of Human Resources, Vol. 43 (4), pp. 738–782. Cunha, Flavio, James J. Heckman, and Susanne M. Schennach (2010) “Estimating the Technology of Cognitive and Noncognitive Skill Formation,” Econometrica, Vol. 78 (3), pp. 883-931. 171 Currie, Janet and Enrico Moretti (2003) “Mother’s Education and the Intergenera- tional Transmission of Human Capital: Evidence From College Openings,” Quar- terly Journal of Economics, Vol. 118 (4), pp. 1495–1532. Cutler, David, Angus Deaton, and Adriana Lleras-Muney (2006) “The Determinants of Mortality,” Journal of Economic Perspectives, Vol. 20 (3), pp. 97–120. Darolles, Serge, Yanqin Fan, Jean-Pierre Florens, and Eric Renault (2011) “Nonpara- metric Instrumental Regression,” Econometrica, Vol. 79 (5), pp. 1541-1565. Das, Mitali (2004) “Simple Estimators for Nonparametric Panel Data Models with Sample Attrition,” Journal of Econometrics, Vol. 120 (1), pp. 159–180. Deaton, Angus S. and Christina Paxson (2001) “Mortality, Education, Income, and Inequality among American Cohorts,” in Themes in the Economics of Aging, Wise D.A. eds. University of Chicago Press, pp. 129–170. D’Haultfoeuille, X. and P F´evrier (2011) “Identification of Nonseparable Models with Endogeneity and Discrete Instruments,” Working Paper. Eckstein, Zvi and Kenneth I. Wolpin (1999) “Why Youths Drop out of High School: The Impact of Preferences, Opportunities, and Abilities,” Econometrica, Vol. 67 (6), pp. 1295-1339. Elbers, Chris and Geert Ridder (1982) “True and Spurious Duration Dependence: The Identifiability of the Proportional Hazard Model,” Review of Economic Studies, Vol. 49 (3), pp. 403–409. Evans, William N. and Jeanne Ringel (1999) “Can Higher Cigarette Taxes Improve Birth Outcomes?” Journal of Public Economics, Vol. 72, No. 1, pp. 135–54. Evdokimov, Kirill (2009) “Identification and Estimation of a Nonparametric Panel Data Model with Unobserved Heterogeneity,” Yale University. Fan, Jianqing and Ir`ene Gijbels (1996) “Local Polynomial Modelling and Its Appli- cations” Chapman & Hall. Florens, Jean-Pierre (2003) “Inverse Problems and Structural Econometrics: The Example of Instrumental Variables,” in Advances in Economics and Economet- rics: Theory and Applications, Dewatripont, Vol. 2, ed. by L.-P. Hansen and S. J. 172 Turnovsky. Cambridge, U.K.: Cambridge University Press, pp. 284-311. Florens, Jean-Pierre, James J. Heckman, Costas Meghir, and Edward J. Vytlacil (2008) “Identification of Treatment Effects Using Control Functions in Models With Continuous, Endogenous Treatment and Heterogeneous Effects,” Econometrica, Vol. 76 (5), pp. 1191-1206. Folland, Gerald, B. (1999) “Real Analysis: Modern Techniques and Their Applica- tions,” Wiley. French, Eric (2005) “The Effects of Health, Wealth, and Wages on Labour Supply and Retirement Behaviour,” Review of Economic Studies, Vol. 72 (2), pp. 395-427. French, Eric and John B. Jones (2011) “The Effects of Health Insurance and Self- Insurance on Retirement Behavior,” Econometrica, Vol. 79 (3), pp. 693-732. Garen, John (1984) “The Returns to Schooling: A Selectivity Bias Approach with a Continuous Choice Variable,” Econometrica, Vol. 52 (5), pp. 1199–1218. Graham, Bryan S. and James Powell (2008) “Identification and Estimation of ‘Irreg- ular’ Correlated Random Coefficient Models,” NBER Working Paper 14469. Hahn, Jinyong (1999) “How Informative is the Initial Condition in the Dynamic Panel Model with Fixed Effects?” Journal of Econometrics, Vol. 93 (2), pp. 309–326. Hall, Peter and Joel L. Horowitz (2005) “Nonparametric Methods for Inference in the Presence of Instrumental Variables,” The Annals of Statistics, Vol. 33 (6), pp. 2904-2929. Halliday, Timothy J. (2008) “Heterogeneity, State Dependence and Health,” Econo- metrics Journal, Vol. 11 (3), pp. 499–516. Hausman, Jerry A. and William E. Taylor (1981) “Panel Data and Unobservable Individual Effects,” Econometrica, Vol. 49 (6), pp. 1377–1398. Hausman, Jerry A. and David A. Wise (1979) “Attrition Bias in Experimental and Panel Data: The Gary Income Maintenance Experiment,” Econometrica, Vol. 47 (2), pp. 455–473. Heckman, James J. (1981a) “Heterogeneity and State Dependence,” in Studies in Labor Markets, Rosen S., eds. University of Chicago Press. pp. 91–139. 173 Heckman, James J. (1981b) “The Incidental Parameters Problem and the Problem of Initial Conditions in Estimating a Discrete Time-Discrete Data Stochastic Process,” in Structural Analysis of Discrete Data with Econometric Applications, Manski C.F., McFadden D., eds. MIT Press. pp. 179–195. Heckman, James J. (1991) “Identifying the Hand of Past: Distinguishing State De- pendence from Heterogeneity,” American Economic Review, Vol. 81 (2), Papers and Proceedings, pp. 75–79. Heckman, James J. (2000) “Causal Parameters and Policy Analysis in Economics: A Twentieth Century Retrospective,” Quarterly Journal of Economics, Vol. 115, pp. 45-97. Heckman, James J. (2007) “The Economics, Technology, and Neuroscience of Human Capability Formation,” Proceedings of the National Academy of Sciences, Vol. 104 (33), pp. 13250–13255. Heckman, James J. and Salvador Navarro (2007) “Dynamic Discrete Choice and Dynamic Treatment Effects,” Journal of Econometrics, Vol. 136 (2), pp. 341–396. Heckman, James J. and Burton Singer (1984) “A Method for Minimizing the Im- pact of Distributional Assumptions in Econometric Models for Duration Data,” Econometrica, Vol. 52 (2), pp. 271–320. Heckman, James J. and Edward Vytlacil (1998) “Instrumental Variables Methods for the Correlated Random Coefficient Model: Estimating the Average Rate of Return to Schooling When the Return is Correlated with Schooling,” Journal of Human Resources, Vol. 33 (4), pp. 974–987. Heckman, James J. and Edward Vytlacil (1999) “Local Instrumental Variables and Latent Variable Models for Identifying and Bounding Treatment Effects,” Proceed- ings of the National Academy of Science, USA, Vol. 96, pp. 4730-4734 Heckman, James J. and Edward Vytlacil (2005) “Structural Equations, Treatment Effects, and Econometric Policy Evaluation,” Econometrica, Vol. 73, No. 3, pp. 669-738 174 Heckman, James J. and Edward J. Vytlacil (2007) “Econometric Evaluations of So- cial Programs, Part 1: Causal Models, Structural Models and Econometric Policy Evaluation,” in J.J. Heckman and E.E. Leamer (ed.) Handbook of Econometrics, Vol. 6, Ch. 70. Hellerstein, Judith K. and Guido W. Imbens (1999) “Imposing Moment Restrictions from Auxiliary Data by Weighting,” Review of Economics and Statistics, Vol. 81 (1), pp. 1–14. Henry, Marc, Yuichi Kitamura, and Bernard Salani´e (2010) “Identifying Finite Mix- tures in Econometric Models,” Cowles Foundation Discussion Paper No. 1767. Hirano, Keisuke, Guido W. Imbens, Geert Ridder, and Donald B. Rubin (2001) “Com- bining Panel Data Sets with Attrition and Refreshment Samples,” Econometrica, Vol. 69 (6), pp. 1645–1659. Hoderlein, Stefan and Enno Mammen (2007) “Identification of Marginal Effects in Nonseparable Models Without Monotonicity,” Econometrica, Vol. 75 (5), pp. 1513- 1518 Hoderlein, Stefan and Enno Mammen (2009) “Identification and Estimation of Local Average Derivatives in Non-Separable Models without Monotonicity,” Economet- rics Journal, Vol. 12, No. 1, pp. 1-25. Hoderlein, Stefan and Halbert White (2009) “Nonparametric Identification in Non- separable Panel Data Models with Generalized Fixed Effects,” CeMMAP Working Papers CWP33/09. Honor´e, Bo E. (1990) “Simple Estimation of a Duration Model with Unobserved Heterogeneity,” Econometrica, Vol. 58 (2), pp. 453–473. Honor´e, Bo E. and Elie Tamer (2006) “Bounds on Parameters in Panel Dynamic Discrete Choice Models,” Econometrica, Vol. 74 (3), pp. 611-629. Hoogerheide, Lennart, Frank Kleibergen, and Herman K. van Dijk (2007) “Natural Conjugate Priors for the Instrumental Variables Regression Model Applied to the AngristKrueger Data,” Journal of Econometrics, Vol. 138 (1), pp. 63–103. 175 Horowitz, Joel L. (1998) “Bootstrap Methods for Median Regression Models,” Econo- metrica, Vol. 66, No. 6, pp. 1327–1351. Horowitz, Joel L. (1999) “Semiparametric Estimation of a Proportional Hazard Model with Unobserved Heterogeneity,” Econometrica, Vol. 67 (5), pp. 1001-1028. Horowitz, Joel L. (2011) “Applied Nonparametric Instrumental Variables Estima- tion,” Econometrica, Vol. 79, No. 2, pp. 347–394. Horowitz, Joel L. and Sokbae Lee (2007) “Nonparametric Instrumental Variables Estimation of a Quantile Regression Model,” Econometrica, Vol. 75 (4), pp. 1191- 1208. Horowitz, Joel L. and Sokbae Lee (2009) “Testing a Parametric Quantile-Regression Model with an Endogenous Explanatory Variable Against a Nonparametric Alter- native” Journal of Econometrics, Vol. 152 (2), pp. 141–152. Hotz, Joseph V. and Robert A. Miller (1993) “Conditional Choice Probabilities and the Estimation of Dynamic Models,” Review of Economic Studies, Vol. 60 (3), pp. 497–529. Hsiao, Cheng (1975) “Some Estimation Methods for a Random Coefficient Model,” Econometrica, Vol. 43, No. 2, pp. 305–325. Hsiao, Cheng (2003) “Analysis of Panel Data.” Cambridge University Press. Hsiao, Cheng and Hashem M. Pesaran (2004) “Random Coefficient Panel Data Mod- els,” IZA Discussion Papers 1236. Hu, Yingyao (2008) “Identification and estimation of nonlinear models with mis- classification error using instrumental variables: A general solution,” Journal of Econometrics, Vol. 144 (1), pp. 27–61. Hu, Yingyao and Susanne M. Schennach (2008) “Instrumental Variable Treatment of Nonclassical Measurement Error Models,” Econometrica, Vol. 76 (1), pp. 195–216. Hu, Yingyao and Matthew Shum (2010) “Nonparametric Identification of Dynamic Models with Unobserved State Variables,” Johns Hopkins University Working Pa- per #543. 176 Imbens, Guido W. and Joshua D. Angrist (1994) “Identification and Estimation of Local Average Treatment Effects ” Econometrica, Vol. 62 (2), pp. 467–475. Imbens, Guido W. (2007) “Nonadditive Models with Endogenous Regressors,” in Advances in Economics and Econometrics: Theory and Applications, R. Blundell, W. Newey, T. Persson, eds. Cambridge University Press. Imbens, Guido W. and Whitney K. Newey (2009) “Identification and Estimation of Triangular Simultaneous Equations Models Without Additivity,” Econometrica, Vol. 77 (5), pp 1481–1512. Jun, Sung Jae (2009) “Local Structural Quantile Effects in a Model with a Nonsep- arable Control Variable,” Journal of Econometrics, Vol. 151 (1), pp 82–97. Jun, Sung Jae, Joris Pinkse, and Haiqing Xu (2011) “Tighter Bounds in Triangular Systems,” Journal of Econometrics, Forthcoming. Karlstrom, Anders, Marten Palme, and Ingemar Svensson (2004) “A Dynamic Pro- gramming Approach to Model the Retirement Behaviour of Blue-Collar Workers in Sweden,” Journal of Applied Econometrics, Vol. 19 (6), pp. 795-807. Kasahara, Hiroyuki, and Katsumi Shimotsu (2009) “Nonparametric Identification of Finite Mixture Models of Dynamic Discrete Choices, Econometrica, Vol. 77 (1), pp. 135–175. Kasy, M. (2011) “Identification in Triangular Systems Using Control Functions,” Econometric Theory, Vol. 27, No. 3, pp. 1–9. Khan, Shakeeb, Maria Ponomareva, and Elie Tamer (2011) “Inference on Panel Data Models with Endogenous Censoring,” Duke University Working Paper 11-07. Kyriazidou, Ekaterini (1997) “Estimation of a Panel Data Sample Selection Model,” Econometrica, Vol. 65 (6), pp. 1335–1364. Lancaster, Tony (1979) “Econometric Methods for the Duration of Unemployment,” Econometrica, Vol. 47 (4), pp. 939–956. Lee, Sokbae (2007) “Endogeneity in Quantile Regression Models: A Control Function Approach,” Journal of Econometrics, Vol. 141 (2), pp. 1131–1158. 177 Lehman, E.L. (1974) “Nonparametrics: Statistical Methods Based on Ranks.” Holden-Day, San Francisco. Lewbel, A. (1999); “Consumer Demand Systems and Household Expenditure”, in Pesaran, H. and M. Wickens (Eds.), Handbook of Applied Econometrics, Blackwell Handbooks in economics. Lewbel, Arthur (2007) “Estimation of Average Treatment Effects with Misclassifica- tion,” Econometrica, Vol. 75 (2), pp. 537-551. Lien, Diana S. and William N. Evans (2005) “Estimating the Impact of Large Cigarette Tax Hikes: The Case of Maternal Smoking and Infant Birth Weight.” Journal of Human Resources, Vol. 40, No. 2, pp. 373–392. Lightwood, James M., Ciaran S. Phibbs, and Stanton A. Glantz (1999) “Short-term Health and Economic Benefits of Smoking Cessation: Low Birth Weight,” Pedi- atrics, Vol. 104, No. 6, pp. 1312–1320. Lleras-Muney, Adriana (2002) “Were Compulsory Attendance and Child Labor Laws Effective? An Analysis from 1915 to 1939,” Journal of Law and Economics, Vol. 45, No. 2, pp. 401–435. Lleras-Muney, Adriana (2005) “The Relationship Between Education and Adult Mor- tality in the United States,” Review of Economic Studies, Vol. 72 (1), pp. 189-221. Luenberger, David, G. (1969) “Optimization by Vector Space Methods,” Wiley- Interscience. Ma, Lingjie and Roger Koenker (2006) “Quantile Regression Methods for Recursive Structural Equation Models,” Journal of Econometrics, Vol. 134, pp. 471506. Maccini, Sharon, and Dean Yang (2009) “Under the Weather: Health, Schooling, and Economic Consequences of Early-Life Rainfall,” American Economic Review, Vol. 99 (3), pp. 1006–1026. Magnac, Thierry and David Thesmar (2002) “Identifying Dynamic Discrete Decision Process,” Econometrica, Vol. 70 (2), pp. 801–816. Mahajan, Aprajit (2006) “Identification and Estimation of Regression Models with Misclassification,” Econometrica, Vol. 74 (3), pp. 631-665. 178 Marschak, Jacob (1953) “Economic Measurements for Policy and Prediction,” in: Hood, W., Koopmans, T. (Eds.), Studies in Econometric Method. Wiley, New York, pp. 1-26. Matzkin, Rosa, L. (2003) “Nonparametric Estimation of Nonadditive Random Func- tions,” Econometrica, Vol. 71 (5), pp. 1339–1375. Matzkin, Rosa, L. (2007) “Nonparametric Identification,” in J.J. Heckman and E.E. Leamer (ed.) Handbook of Econometrics, Vol. 6, Ch. 73. Moffitt, Robert, John Fitzgerald, and Peter Gottschalk (1999) “Sample Attrition in ´ Panel Data: The Role of Selection on Observables,” Annales d’Economie et de Statistique No. 55/56, pp. 129–152 Newey, Whitney K. and James L. Powell (2003) “Instrumental Variable Estimation of Nonparametric Models,” Econometrica, Vol. 71 (5), pp. 1565–1578. Pagan, Adrian and Aman Ullah (1999) “Nonparametric Econometrics,” Cambridge University Press. Pakes, Ariel, Michael Ostrovsky, and Steven Berry (2007) “Simple Estimators for the Parameters of Discrete Dynamic Games (with Entry/Exit Examples),” The RAND Journal of Economics, Vol. 38 (2), pp. 373–399. Pearl, Judea (2000) “Causality.” Cambridge University Press. Pesaran, Hashem M. and Ron Smith (1995) “Estimating Long-Run Relationships from Dynamic Heterogeneous Panels,” Journal of Econometrics, Vol. 68 (1), pp. 79-113. Pesendorfer, Martin, and Philipp Schmidt-Dengler (2008) “Asymptotic Least Squares Estimator for Dynamic Games,” Review of Economic Studies, Vol. 75 (3), pp. 901– 928. Ridder, Geert (1990) “The Non-Parametric Identification of Generalized Accelerated Failure-Time Models,” Review of Economic Studies, Vol. 57 (2), pp. 167–181. Ridder, Geert (1992) “An Empirical Evaluation of Some Models for Non-Random Attrition in Panel Data,” Structural Change and Economic Dynamics, Vol. 3 (2), pp. 337–355. 179 Ridder, Geert and Tiemen M. Woutersen (2003) “The Singularity of the Information Matrix of the Mixed Proportional Hazard Model,” Econometrica, Vol. 71 (5), pp. 1579–1589. Rosenzweig, Mark R. and T. Paul Schultz (1983) “Estimating a Household Production Function: Heterogeneity, the Demand for Health Inputs, and Their Effects on Birth Weight,” Journal of Political Economy, Vol. 91, No. 5, pp. 723–746 . Ruhm, Christopher J. (2000) “Are Recessions Good for Your Health?” Quarterly Journal of Economics, Vol. 115 (2), pp. 617–650. Rust, John (1987) “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,” Econometrica, Vol. 55 (5), pp. 999–1033. Rust, John (1994) “Structural Estimation of Markov Decision Processes,” in R. Engle and D. McFadden (ed.) Handbook of Econometrics, Vol. 4, Ch. 51. Rust, John and Christopher Phelan (1997) “How Social Security and Medicare Affect Retirement Behavior In a World of Incomplete Markets,” Econometrica, Vol. 65 (4), pp. 781–831. Schennach, Susanne M. (2007) “Instrumental Variable Estimation of Nonlinear Errors-in-Variables Models,” Econometrica, Vol. 75 (1), pp. 201-239. Schennach Susanne M., Suyong Song, and Halbert White (2011) “Identification and Estimation of Nonseparable Models with Measurement Errors,” unpublished man- uscript. Schultz, Paul T. (2002) “Wage Gains Associated with Height as a Form of Health Human Capital,” American Economic Review, Vol. 92 (2), pp. 349–353. Shen, Xiaotong (1997) “On Methods of Sieves and Penalization,” Annals of statistics, Vol. 25 (6), pp. 2555–2591. Shiu, Ji-Liang, and Yingyao Hu (2011) “Identification and Estimation of Nonlinear Dynamic Panel Data Models with Unobserved Covariates,” Johns Hopkins Univer- sity Working Paper #557. Snyder, Stephen E., and William N. Evans (2006) “The Effect of Income on Mortality: Evidence from the Social Security Notch,” Review of Economics and Statistics, Vol. 180 88 (3), pp. 482–495. Sullivan, Daniel, and Till von Wachter (2009a) “Average Earnings and Long-Term Mortality: Evidence from Administrative Data,” American Economic Review, Vol. 99 (2), pp. 133-138. Sullivan, Daniel, and Till von Wachter (2009b) “Job Displacement and Mortality: An Analysis Using Administrative Data,” Quarterly Journal of Economics, Vol. 124 (3), pp. 1265–1306. Stock, James H. and David A. Wise (1990) “Pensions, the Option Value of Work, and Retirement,” Econometrica, Vol. 58 (5), pp. 1151–1180. Torgovitsky, Alexander (2011) “Identification and Estimation of Nonparametric Quantile Regressions with Endogeneity,” Working Paper. Wooldridge, Jeffrey, M. (1997) “On Two Stage Least Squares Estimation of the Av- erage Treatment Effect in a Random Coefficient Model,” Economics Letters, Vol. 56 (2), pp. 129–133. Wooldridge, Jeffrey M. (2001) “Econometric Analysis of Cross Section and Panel Data,” The MIT Press. Wooldridge, Jeffrey M. (2002) “Inverse Probability Weighted M-Estimators for Sam- ple Selection, Attrition, and Stratification,” Portuguese Economic Journal, Vol. 1 (2), pp. 117–139. Wooldridge, Jeffrey M. (2005) “Simple Solutions to the Initial Conditions Problem in Dynamic, Nonlinear Panel Data Models with Unobserved Heterogeneity,” Journal of Applied Econometrics, Vol. 20 (1), pp. 39–54. 181