WHAT CAN BE LEARNED FROM OBSERVED BEHAVIOR? : ESSAYS ON THE PREDICTIVE ABILITY OF THE UTILITY MAXIMIZATION MODEL AND TESTING FOR SATISFICING BY MARÍA JOSÉ BOCCARDI CHALELA B.A., UNIVERSIDAD DE MONTEVIDEO, 2006 M.A., BROWN UNIVERSITY, 2011 A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE DEPARTMENT OF ECONOMICS AT BROWN UNIVERSITY PROVIDENCE, RHODE ISLAND MAY 2016 c Copyright 2016 by María José Boccardi Chalela This dissertation by María José Boccardi Chalela is accepted in its present form by the Department of Economics as satisfying the dissertation requirements for the degree of Doctor of Philosophy. Date Doctor Mark Dean, Advisor Recommended to the Graduate Council Date Doctor Roberto Serrano, Reader Date Doctor Andriy Norets, Reader Approved by the Graduate Council Date Peter Weber, Dean of the Graduate School iii VITA The author was born April 25th, 1985 in Montevideo, Uruguay. Entering the Universi- dad de Montevideo in 2003, she received her B.A. in 2006. In the same institution she started a Masters in Economics and taught several undergraduate classes. Meanwhile, she worked for the Uruguayan Bureau of Statistics 2006-2010 as the project leader for the new CPI index at the national level, base December 2010=100. In August of 2010 she entered Brown University to pursue her degree in Economics. She received her M.A. in 2011 and Ph.D. in 2016. iv Acknowledgements I would like to express my special appreciation and thanks to my advisor Professor Dr. Mark Dean, you have been a tremendous mentor for me. I would like to thank you for encouraging my research and for allowing me to grow as an economist. Your ad- vice on both research as well as on my career have been priceless. I would also like to thank my committee members, Professor Dr. Roberto Serrano and Professor Dr. An- driy Norets for serving as my committee members and special thanks to you for all the support during the job market experience, and for your brilliant comments and sug- gestions. I would like to thank the economics department office staff and especially Angelica Vargas, her guidance through the entire graduate school experience and im- measurable support made me successfully survive to the program and particularly to the job market experience. I would also like to show my gratitude to Emily Oster for her guidance on the job search process, Pedro dal Bo, Geoffroy deClippel, Jack Fanning, Frank Kleibergen, Anja Sautmann, Rajiv Vohra and seminar participants at Brown University, Econcon2015, the 3rd Annual Warwick Economics PhD Conference and Washington University 9th Annual Graduate Student Conference for helpful comments and suggestions that contributed to this thesis. I want to thank all of my teachers at Instituto Pallotti for their guidance. I would also like to thank my Professors at Universidad de Montevideo especially Juan Dubra, v vi Marcelo Caffera, Lorenzo Caliendo and Fernando Borraz who pushed me to be more than I could ever imagine I was capable of, and, without their support and encourage- ment, I would have never thought or tried to pursue my PhD. Graduate school is, of course, a long and arduous process. Without the right friends and family who can both support your work and balance it with play, it would be im- possible to make it through. I owe a special thanks to my friends and family. Words cannot express how grateful I am to my father and my youngest brother Martín, for all of the sacrifices that they have made on my behalf, their encouragement in this endeav- our and constant support throughout the process, even when the situation back home was far from ideal. A great factor in my success and that made me who I am is my friends, the family of my own choice. To my high school and college friends I owe my sanity, I would have not been able to make it this far without them. For that I have to thank Valentina Canto, Cynthya Broquetas, Adriana Montero, Florencia Mazzoni and the CATUM group. Also thanks to all my colleagues at INE, who helped me grow as a professional and provided me with their support to apply to graduate school. For keeping me sane and showing me that there is a life outside grad school, I would like to thank to everyone I met through the practice of handball and field hockey, as well as the staff at the GCB. Finally, I would also like to thank all of my "Providence friends" who supported me in writing, and incentiviced me to strive towards my goal. For this I have to thank, among others, James Budarz, Nicholas Coleman, Sean Dinces, Morgan Hardy, Angélica Meinhofer, Nickolai Riabov, Daniela Scidá and Tim Squires. Contents 1 Predictive Ability and the Fit-Power Trade-Off in Theories of Consumer Be- havior 7 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 The Rationality Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Axiomatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.2 Challenges for Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Prediction Based on Axiomatic Models . . . . . . . . . . . . . . . . . . . . . 16 1.3.1 Set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3.2 Predictions Based on GARP . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.3 Stochastic Extension and Predictive Distribution . . . . . . . . . . 24 1.3.4 On the Estimation of the Error Process . . . . . . . . . . . . . . . . 32 1.4 Predictive Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.4.1 Decision Making and the Predictive Distribution . . . . . . . . . . 34 1.4.2 Marginal Likelihood and Model Averaging . . . . . . . . . . . . . . 36 1.4.3 Measures of Predictive Ability . . . . . . . . . . . . . . . . . . . . . . 38 1.4.4 Comparison to other measures in the literature . . . . . . . . . . . 45 1.4.5 Optimal Design of Experiments . . . . . . . . . . . . . . . . . . . . . 50 1.5 Empirical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1.5.1 Effect of Additional Information . . . . . . . . . . . . . . . . . . . . 53 1.5.2 Empirical Performance of Predictive Ability Measures and Com- parison to Other Measures in the Literature . . . . . . . . . . . . . 56 1.5.3 Model Comparison: GARP vs. Parametric Specification . . . . . . 57 1.6 Relation to the Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 1.A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 1.A.1 Proofs of Section 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 1.A.2 Proof of Section 1.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 1.A.3 Proofs of Section 1.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 64 1.B Error Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 1.B.1 Algorithms - Estimation of the Error Process in Finite Samples . . 65 1.B.2 Performance of Algorithms 1.26, 1.27 and 1.28 for Simulated Data 67 1.C Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 1.C.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 1.C.2 Comparison Between Subjects 313 and 516 . . . . . . . . . . . . . 73 vii CONTENTS viii 2 Satisficing and Stochastic Choice - with Victor Aguiar and Mark Dean 75 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.2 Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.2.2 The Satisficing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.3 Characterizing the Satisficing Model. . . . . . . . . . . . . . . . . . . . . . 83 2.3.1 A Negative Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.3.2 Full Support Satisficing Models . . . . . . . . . . . . . . . . . . . . . 84 2.3.3 Fixed Distribution Satisficing Models . . . . . . . . . . . . . . . . . 89 2.3.4 Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 2.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 2.4.1 Fixed Distributions Without Full Support . . . . . . . . . . . . . . . 96 2.4.2 Allowing for Indifference . . . . . . . . . . . . . . . . . . . . . . . . . 97 2.4.3 Incomplete Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.4.4 Random Utility and Random Threshold . . . . . . . . . . . . . . . . 101 2.5 Relation to Existing Literature . . . . . . . . . . . . . . . . . . . . . . . . . . 108 2.A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 2.A.1 Proof of Proposition 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 110 2.A.2 Proof of Lemma 2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 2.A.3 Proof of Theorem 2.11 . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2.A.4 Proof of Theorem 2.14 . . . . . . . . . . . . . . . . . . . . . . . . . . 113 2.A.5 Proof of Theorem 2.17 . . . . . . . . . . . . . . . . . . . . . . . . . . 114 2.A.6 Proof of Theorem 2.18 . . . . . . . . . . . . . . . . . . . . . . . . . . 116 2.A.7 Proof of Claim 2.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 2.A.8 Proof of Proposition 2.23 . . . . . . . . . . . . . . . . . . . . . . . . . 117 2.A.9 Comparative Statics with Respect to Search Orders . . . . . . . . . 119 2.A.10 Proof of Proposition 2.28 . . . . . . . . . . . . . . . . . . . . . . . . . 120 2.A.11 Proof of Theorem 2.32 . . . . . . . . . . . . . . . . . . . . . . . . . . 121 2.A.12 Proof of Theorem 2.33 . . . . . . . . . . . . . . . . . . . . . . . . . . 124 2.A.13 Proof of Theorem 2.34 . . . . . . . . . . . . . . . . . . . . . . . . . . 124 2.A.14 Proof of Lemma 2.39 . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 2.A.15 Proof of Lemma 2.40 . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 2.A.16 Proof of Lemma 2.36 . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 2.A.17 Proof of Theorem 2.37 . . . . . . . . . . . . . . . . . . . . . . . . . . 131 2.A.18 Proof of Lemma 2.41 . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 2.A.19 Proof of Theorem 2.42 . . . . . . . . . . . . . . . . . . . . . . . . . . 133 2.A.20 Proof of Theorem 2.48 . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Bibliography 141 List of Tables 1.1 Data for Example 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Data for Example 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3 Data for Example 1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Simulations for the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.5 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1.6 Summary statistics for (Choi, Fisman, Gale, and Kariv 2007) - half and full sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 1.7 Statistics comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 1.8 Model comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 1.9 Simulations algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 1.10 Summary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 1.11 Summary statistics -half of observations. . . . . . . . . . . . . . . . . . . . . 69 1.12 Comparison - half of observations . . . . . . . . . . . . . . . . . . . . . . . . 69 1.13 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.1 An example of the satisficing model . . . . . . . . . . . . . . . . . . . . . . . 82 2.2 Changes in the probability of search orders and its effect on DM’s well- being. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 ix List of Figures 1 Relation between empirical consistency, falsifiability and predictive ability. 5 1.1 The goodness of fit problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 The power of the rationality test . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Relation between empirical consistency, falsifiability and predictive ability 18 1.4 Observed Choices and the "supporting Set" . . . . . . . . . . . . . . . . . . 21 1.5 Observed economic environments and the "supporting set". . . . . . . . . 22 1.6 "Supporting set" and number of observations . . . . . . . . . . . . . . . . . 23 1.7 Predictive Distribution - Intuition . . . . . . . . . . . . . . . . . . . . . . . . 27 1.8 Relation between empirical consistency, falsifiability and predictive ability 29 1.9 Predictive densities for Example 2.3 . . . . . . . . . . . . . . . . . . . . . . . 30 P r o jec t ed−SSR 1.10 Histogram of Gener at ed−SSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.11 Relation between empirical consistency, falsifiability and predictive ability. 36 1.12 Comparison BC and predictive ability measures . . . . . . . . . . . . . . . 49 1.13 Comparison BC and predictive ability measures . . . . . . . . . . . . . . . 50 1.14 Optimal experimental design for prediction . . . . . . . . . . . . . . . . . . 53 1.15 Comparison between subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 73 1.16 Predictive densities - comparison. . . . . . . . . . . . . . . . . . . . . . . . . 74 x Introduction An economic model can be understood as a theoretical construct that, by simplifying a complex reality, seeks to aid in the understanding and/or prediction of economic pro- cesses. As such, for the most part, these models are not meant to replicate an intricate reality but intend to help in the understanding of the processes that drive observed out- comes and, if desired, to be used in the prediction of responses to unobserved economic environments.1 Though economic models can be thought as interesting theoretical exercises to bet- ter understand specific economic processes2 , special interest lies in checking whether these models can be supported by empirical evidence and/or be useful to predict behav- ior in out-of-sample economic environments. For that, it is important to understand the connection between the theory and its empirical counterpart. Axiomatizations provide the description of the implications of the model in terms of observable behavior, deliv- ering necessary and sufficient conditions that are equivalent to the theoretical model.3 Consider, for example, the utility maximization model, that is the focus of Chapter 1. Utility functions are unobservable, but their behavioral manifestation -choices- are not. The Generalized Axiom of Revealed Preference (GARP henceforth) provides the neces- sary and sufficient conditions on observed behavior for it to be consistent with a decision 1 For more on the role of models in economics refer to (Gibbard and Varian 1978), (Sugden 2000), (Sugden 2009), (Mäki 2005), (Caplin and Schotter 2010), among others. 2 Notable examples are the "The Market for Lemons: Quality Uncertainty and the Market Mechanism" by (Akerlof 1970) and "Job Market Signalling", (Spence 1973). 3 (Dekel and Lipman 2010) argue for the value provided to axiomatic derivations. 1 2 maker that behaves as if her choices maximize her utility function among feasible al- ternatives. Understanding the behavioral implications of the model is not only relevant for test- ing its empirical validity; but it is also important to understand them when selecting among theories for modeling behavior. Two seemingly distinct behavioral models can- not be teased apart if they result in the same observed data; that is, if for a sequence of decision problems the behavioral implications of two competing models are not sig- nificantly distinct, then the theoretical differences between the models are irrelevant.4 The empirical content of a model depends on the behavior that pretends to explain and available data; i.e. comparisons across competing models can only be made conditional on the behavior that is intending to explain for the considered sequences of economic environments. For example, the satisficing model5 is a highly influential and intuitive model of bounded rationality, but it cannot be tested using standard choice data; its be- havioral implications are indistinguishable from those of a standard utility maximizer. In Chapter 2, my coauthors and I propose an alternative way to test for the satisficing model based on stochastic choice data. Assuming that preferences are fixed, but search order may change randomly, the model predicts that stochastic choice can only occur amongst elements that are always chosen, while all other choices must be consistent with standard utility maximization. Adding the assumption that the probability distri- bution over search orders is the same for all choice sets makes the satisficing model a subset of the class of random utility models. The empirical content of a model and the suitability of it to explain a given data set depends on the sequence of decision problems that is being studied. Consider again the utility maximization model. If the decision maker faces menus of alternatives A = {a, b} 4 (Gul, Pesendorfer, et al. 2008) 5 (Simon 1955) 3 and B = {c, d}, the utility maximization model has no empirical content, i.e. any pro- file of choices from those menus can be rationalized. Consider now a decision maker choosing from A0 = {a, b, c} and B 0 = {b, c, d}, for those sets of feasible alternatives the theory has empirical content. The empirical content of a model also depends on the behavior that seeks to ex- plain/predict. For example, assume that the decision maker chooses C(A0 ) = {a}; then any choice from B 0 would be consistent with the model, i.e. choices from A0 are not informative about the predictions of the model for B 0 . This latter example raises a few questions when considering limited data sets; what information can be learned about preferences? what predictions can be made based on observed behavior?. These ques- tions are addressed, for the case of the utility maximization model, in Chapter 1. As discussed in (De Clippel and Rozen 2014) the latter is not a trivial question, for many bounded rationality models. The authors show that for some models, it is the case that a limited data set may be consistent with a subset of possible menus but no extension for a complete data set would be. (Gabaix and Laibson 2008) describes the seven key properties of useful economic models: parsimony, tractability, conceptual insightfulness, generalizability, falsifiability, empirical consistency, and predictive precision. The last three properties pertain to the relation between the theoretical model and its empirical counterpart. A model is falsifi- able if it makes non trivial predictions on behavior, i.e. if there is some feasible behavior profile that is inconsistent with the model. As a simplification of the reality, a model is not expected to perfectly explain human behavior in all possible circumstances. How- ever, as researchers, we are interested in the empirical accuracy of the model, and when observed behavior is not consistent with the model, the degree to which the model fits the data. This is known as the goodness of fit criterion. 4 The goodness of fit of a model is not the only relevant criteria when judging the suit- ability of a model. The limitations of this criterion are specially relevant when dealing with axiomatic models with limited data sets, as discussed in Chapter 1. In particular, a model may be empirically accurate because data was indeed generated by a decision maker that behaves as the theory prescribes; but it could be also the case that it is empir- ically accurate because, for the considered sequence of decision problems, the model prescribes weak (loose) predictions, such that almost any behavior would be consis- tent with the theory. This caveat relates to predictive precision. A model that delivers more precise predictions is desirable because it facilitates its evaluation and testing, furthermore, "A model with predictive precision may even be useful when it is empirically inaccurate(...) In general, models that make approximately accurate strong predictions are much more useful than models that make exactly accurate weak predictions. "6 . Fig- ure 1 illustrates the importance of predictive precision and how empirically inaccurate models can be useful if they deliver strong predictions. Chapter 1 exploits the connection between the predictive precision and the power to identify inconsistent behavior conditional on observed choices, proposing a predictive ability approach for model assessment. This approach provides a meaningful answer to the tension between the severity of the violations (fit) and the sensitivity of the test to detect them (power), a long standing problem in the literature when dealing with ax- iomatic representations for behavioral models. The discussion in the chapter is centered around the utility maximization model, though most results extend to a general class of behavioral models. Utility maximization is a core assumption in economics for which the generalized axiom of revealed preference (GARP) provides a nonparametric test. However, two problems arise when testing GARP. First, it provides an extremely sharp 6 (Gabaix and Laibson 2008) 5 Figure 1: Relation between empirical consistency, falsifiability and predictive ability. Em- pirical consistency -fit- is important but does not implies usefulness of the model. Even when falsifiable it may not restrict behavior enough to allow for learning about underlying behavior. test since rejection occurs after only one violation. Second, even when a data set passes the test, it cannot necessarily be understood as a success for the theory since conditions may be so undemanding that any behavior would pass it. Then, the predictive ability approach developed in the chapter naturally establishes a meaningful trade-off between empirical accuracy and falsifiability for models of consumer behavior, a long standing problem in the literature. Intuitively, better predictions are the result of a lot of rev- eled preference information –better degree of identification of underlying preferences– while requiring small errors for data to be consistent with the model. This dissertation emphasizes the importance of the empirical implications for models of behavior. Chapter 1 discusses the limitations of goodness of fit approach for axiomatic models when dealing with incomplete data sets. The lack of power to identify inconsis- tent behavior with limited data sets can be captured by uncertainty to predict behavior based in the model. By accounting for this uncertainty and allowing for empirically inaccurate behavior, the predictive ability approach provides a meaningful criterion to 6 study the suitability of a theory to model observed behavior. Chapter 2 provides a novel set of conditions to identify whether a decision maker behaves as satisficing, in contrast to standard utility maximizer. This is done by exploiting the structure of stochastic choice data without requiring data on the decision process itself. Outline The outline of this dissertation is as follows. Chapter 1 presents a predictive approach for the assessment of the suitability of models when these are described by a set of axioms.7 Chapter 2 presents necessary and sufficient conditions for stochastic choice data to be consistent with the satisficing model.8 Finally Section Conclusion closes presenting a summary of the main results and open lines of future research. 7 The content of this chapter has been presented before in the paper "Predictive Ability and the Fit- Power Trade-Off in Theories of Consumer Behavior" 8 The content of this chapter has been circulated under the title "Satisficing and Stochastic Choice" coauthored with Victor H. Aguiar and Mark Dean. Chapter 1 Predictive Ability and the Fit-Power Trade-Off in Theories of Consumer Behavior 1.1 Introduction The existence of a utility function that represents preferences is a core assumption in economics. The Generalized Axiom of Revealed Preferences (GARP henceforth) pro- vides an elegant axiomatic test for the validity of this assumption. For GARP, rejection occurs after only one violation, providing a sharp test which most data sets violate. This is a common feature of all models that are described using a set of behavioral axioms. Axioms provide elegant nonparametric tests, but only provide binary information as to whether a particular data set is consistent with the model. In case of inconsistent data, axiomatizations say nothing about the significance of the departures from the model. This is known as the goodness of fit problem in the literature. A series of goodness of fit measures have been proposed to address the sharpness 7 8 of the rationality test. Most of these measures are based on an intuitive moment of the data related to the adjustments to income needed to remove the violations by making them no longer feasible.1 The significance of these measures is difficult to gauge since the experimental design may be such that no or very loose constraints are imposed on the data. For example, if budget sets are nested, any data set would exhibit perfect consistency with GARP since no constraints are imposed by the model. This is known in the literature as the power problem since it relates to the probability of identifying violations when the data is generated by a non-rational process.2 This chapter assesses the performance of the model by asking whether the model is useful to make precise predictions about behavior. A good model is one which pro- vides precise predictions, reducing the uncertainty of forecasted behavior for unseen economic environments. More precise predictions are the result of a large amount of revealed preference information that can be learned from data, while requiring a small error for observed behavior to be consistent with the model. Given behavior, more stringent environments that impose more demanding constraints on data to sat- isfy the model, result in more precise revealed preference information learned from data. However, more demanding environments increase the likelihood of detecting vio- lations, leading to overall bigger errors. Hence, the predictive ability approach provides a meaningful way to integrate fit and the degree to which the data constrains the model. In order to construct the predictive distribution of choices in a new environment, I extend the model to embed it in a statistical framework allowing for an additive error component. Within this framework, I recover the "candidate model" – the most likely sequence of rational choices that generated the data– and an estimate for the error pro- 1 Notable examples are (Afriat 1967), (Varian 1990) and (Echenique, Lee, and Shum 2011) 2 See (Andreoni and Harbaugh 2013) for an extensive analysis of the power problem when testing for GARP. 9 cess. Next, I combine these two to compute the predictive distribution generated by the extended model. For an out-of-sample economic environment, I first construct the set of choices that are consistent with the "candidate model". This is the "supporting set" first introduced by (Varian 1982). For an alternative in the "supporting set", the model predicts that choices are given by that alternative plus the error. Then, follow- ing the intuition of Bayesian model averaging, I construct the predictive distribution as the distribution of choices on the "supporting set" plus the distribution of the error process. Without prior information, all alternatives in the supporting set are assumed to be equally likely. This distribution can be interpreted as the prediction that can be made by only assuming the nonparametric model, given observed behavior. This con- struction extends to a general class of behavioral models, provided the availability of an algorithm to recover the "candidate model". For GARP, it is available in the form of a minimum distance estimator provided by (Kocoska 2012). The predictive distribution reflects two sources of uncertainty when forecasting be- havior: "model uncertainty", which derives from the fact that there may be different preference relations that are consistent with observed data3 ; and "error uncertainty", which derives from the fact that the decision maker may not perfectly maximize her utility. A model is better for prediction if it reduces either or both sources of uncer- tainty when forecasting behavior. Hence, the predictive distribution implicitly defines a trade-off between the amount of information that can be learned about preferences and the severity of the violations, naturally extending the prediction that would result from the axiomatization to allow for error. As I discuss in Section 1.4, there is a strong link between the amount of recoverable revealed preference information and the power of data set to highlight violations. 3 To illustrate this point consider the following simple example. Let the universe of choices be {a, b, c}, and let observed data be such that a is chosen when {a, b} were available, and a chosen from {a, c}. Then we may have a  b  c or a  c  b with equal probability. 10 For a researcher considering a model for predicting behavior for a particular appli- cation, the predictive distribution is a sufficient statistic to compute optimal choices and expected losses due to incorrect predictions. For a set of candidate models, the predic- tive distribution can be used to calculate an analog of the marginal likelihood of each model, that is, how likely observed data are given the model. Given a prior on the likeli- hood of each candidate model, marginal likelihoods can be used to construct Bayes-like factors for model selection. Marginal likelihoods can also be used to compute weights for bayesian model averaging (BMA) to account for uncertainty with respect to model selection. The BMA predictive distribution is constructed as the weighted average of the predictive distribution provided for each candidate model, where the weights are given by the likelihood of the model given observed data. For a researcher that does not have a particular application of interest but seeks to appraise the predictive performance of the model, I provide two intuitive statistics that summarize the information content of the predictive distribution. These statistics define a complete order of data sets and/or models in the domain of predictive dis- tributions. These summary statistics are: (i) the size of the shortest α− level credible set for predicted choices and (ii) the Kullback-Leibler divergence measure between the predictive distribution and an uninformative prior. These measures reflect the extent of both sources of uncertainty, model and error, and can be used for the design of exper- iments. The selection of the economic environment in which to predict behavior may affect the results, therefore, I propose a leave-one-out construction for these measures to remove any arbitrariness in this regard. The leave-one-out (LOO) predictive preci- sion measures are computed based on the predictive distribution constructed for each observed economic environment considering all remaining observations. 11 I study the empirical performance of the predictive measures in their application to the experimental data from (Choi, Fisman, Gale, and Kariv 2007). As expected, mea- sures of predictive precision are positively correlated with measures of goodness of fit and with the amount of information that can be extracted from data. I find that more observations lead to more precise predictions since the additional revealed preference information outweighs the potential increase in estimated errors. Finally, I compare the performance of the Cobb-Douglas model, nested into the rationality model, and GARP, which does not make any assumption about the shape of the utility function. When comparing the two models, the results show that the nonparametric utility maximiza- tion model provides more precise predictions than a model that assumes Cobb-Douglas preferences, since the additional precision of the latter model is outweighed by the mis- specification error. This chapter ties together the goodness of fit and power literatures. The information provided by standard goodness of fit measures is captured by the predictive precision measures through the recovered error process, as shown in Proposition 1.21. Addi- tionally, the LOO predictive precision measures reflect power, through the size of the supporting set. Given the projection using J − 1 observations, the supporting set is the subset of feasible choices, in the J th economic environment, that are jointly consistent with the projection. Conditional on the projection, its relative size is the probability of choices being consistent with the model, if they were to be generated uniformly at ran- dom from the set of feasible choices. Then, the relative size of the supporting set is the complement of the probability of detecting violations in the J th economic environment under (Becker 1962) definition of irrational behavior; that is, power –as traditionally understood in the literature, (Bronars 1987)– conditional on the projection.4 Hence, 4 To illustrate this point consider a decision maker choosing from {a, b, c} and {a, c}. If a (or c) is chosen from {a, b, c} then {a, c} imposes demanding constraints since it is possible to violate GARP. On the other hand, if b is chosen from {a, b, c} then any choice from {a, c} would be consistent with GARP. 12 this approach provides a meaningful trade-off between fit and power, a long standing problem in the literature. Current approaches that combine fit and power, as (Beatty and Crawford 2011), measure power consistent with (Bronars 1987) approach. However, this approach fails to account for the actual pattern of choices observed.5 Observed choices determine whether the new budget set constrains data or not. This is important since, for the empirical application -50 observations-, standard power is approximately one for all subjects which does not allow to differentiate among the stringentness of different de- signs. I show that two subjects that faced identical economic environments and whose behavior is consistent with GARP, can produce significantly different predictions for the same economic environment due to differences in the supporting set. I show that uncon- ditional power masks significant differences in the amount of preference information that can be inferred for different subjects to predict behavior. Outline The outline of this chapter is as follows. Section 1.2 presents the main re- sults of revealed preference theory, its testable implications and the challenges that it presents for testing. Section 1.3 introduces the construction of the predictive distribu- tion and discusses its properties. Section 1.4 shows that the predictive distribution is a sufficient statistic for a decision maker that wants to select a model to predict behav- ior. For a researcher that does not have an application in mind, I show how to use the predictive distribution to construct the marginal likelihood of the model and measures that summarize the predictive performance of the model. Section 1.5 shows the empir- ical performance of the proposed measures in its application to the experimental data from (Choi, Fisman, Gale, and Kariv 2007). Section 1.6 offers a review of the literature 5 (Andreoni and Harbaugh 2013) presents a series of power measures and indices whose behavior depend on the characteristics of choices in the population and experimental design. 13 concerning rationality testing and how the approach followed in this chapter compares to existing approaches. Finally, section 1.7 concludes. 1.2 The Rationality Model 1.2.1 Axiomatization When facing a set of feasible alternatives, the choices made by a decision maker reveal information about her own preference relation, chosen alternatives are revealed to be better than non-chosen ones. This is the concept of (directly) revealed preference, first introduced by (Samuelson 1938). (Houthakker 1950) extended it by imposing transitivity on the direct revealed preferred relation. Let {q j }Jj=1 be the sequence of observed choices, with q j ∈ R+L . The set of feasible alternatives is determined by the decision maker’s budget constraints, that is, for all j ∈ {1, . . . , J}, p j · q j ≤ x j , where p j ∈ R++ L is the price vector faced by the decision maker in environment j, and x j her income. Then, chosen alternatives q j are said to be revealed preferred to the other feasible alternatives that were not chosen. Formally, Definition 1.1 (Directly Revealed Preferred) q j is directly revealed preferred to q˜ if p j · q j ≥ p j · q˜, and it is strictly revealed preferred if p j · q j > p j · q˜ Definition 1.2 (Revealed Preferred) q j is revealed preferred to q˜ if there is a chain of directly revealed preferred bundles linking q j to q˜ (Afriat 1967) theorem shows that revealed preference theory provides a nonpara- metric condition on consumer’s choices that is necessary and sufficient for observed be- havior to be consistent with non satiated utility maximization: the generalized axiom of revealed preference (GARP). This axiom imposes internal consistency on the infor- mation contained in the revealed preference relation for preferences to be represented by a well behaved utility function. Formally, 14 x x (a) Data consistent with GARP (b) Data inconsistent with GARP Figure 1.1: The goodness of fit problem. The sharpness of the test is evidenced in the above figure. Panel 1.1a shows data that is perfectly consistent with GARP, while in panel 1.1b one observation -x- was perturbed. The data in panel 1.1b is not longer consistent with utility maximization. Definition 1.3 (Generalized Axiom of Revealed Preference (GARP)) If q j is revealed preferred to q˜, then q˜ is not strictly revealed preferred to q j 1.2.2 Challenges for Testing Two main problems arise when testing for the revealed preference conditions: fit and power. First, GARP delivers a sharp test, since rejection occurs after one violation, re- gardless of the number of consistent observations and/or severity of the violations. Figure 1.1 depicts the case of a decision maker for which we observed 14 choices. Panel 1.1a shows the case of perfectly consistent choices, while panel 1.1b shows the same choices but choice x has been slightly perturbed. The data in panel 1.1b is no longer consistent with GARP. Most empirical studies find violations to the deterministic axiomatization.6 6 From the empirical point of view, several studies have shown the existence of GARP violations in different settings, cross section data ((Blundell, Browning, and Crawford 2003), (Beatty and Crawford 2011) and (Echenique, Lee, and Shum 2011), among others); laboratory experiments ((Andreoni and Miller 2002), (Sippel 1997), (Choi, Fisman, Gale, and Kariv 2007), among others); as well as in field experiments (for example in (Harbaugh, Krause, and Berry 2001)). 15 Naturally one may ask how severe these violations are. This is known in the literature as the goodness of fit problem. Measures of fit have been proposed in the literature that are, for the most part, based on the extent of the income adjustment to budget sets required to remove inconsistencies by making them no longer feasible. These provide intuitive measures of the size of the departures, but a high goodness of fit measure can hardly be understood as a success for the model. Good fit could indicate behavior closely consistent with rationality or failure to detect violations. The stated conditions may deliver very loose or non restrictive constraints on data. For example, if the decision maker faces nested budget sets any behavior is consistent with GARP. Then one may ask how sensitive the test is to detect departures from ratio- nality. Conditional on observed behavior, the more the data constrains choices, the more precise the identification of the preference relation generating choices. The link between the amount of revealed preference information that can be learned from data and the sensitivity of the test to detect violations of GARP is studied in Section 1.4.4. The literature has addressed this problem by providing power indices. These in- dices measure the probability of observing GARP violations if choices were to be gen- erated uniformly at random from the set of feasible alternatives, see (Bronars 1987) and (Beatty and Crawford 2011). However, the sensitivity of the test to detect de- partures from the model does not only depend on the number and tightness of the constraints imposed by the model, but also on the interaction of these with observed choices. To illustrate this point consider the problem shown in figure 1.2. The budget sets faced by the decision maker can be informative -as in Panels 1.2b and 1.2c- or not informative at all, as in Panel 1.2a. Moreover, given observed budget sets, the proba- bility of detecting violations of GARP depends on observed behavior, and therefore the 16 (a) Uninformative design (b) Informative budget sets (c) Potentially informative design Regardless of Choices Informative choices Uninformative choices Figure 1.2: The power of the rationality test. This figure presents hypothetical budget sets faced by a decision maker. Panel 1.2a shows two nested budget sets. Independent of observed behavior it is impossible to detect violations of GARP or learn about underlying behavior. Pan- els 1.2b and 1.2c both depict the same budget sets. This design is potentially informative. The budget sets in panel 1.2b are informative given the distribution of choices, while for panel 1.2c these do not allow to infer significant information about the process generating the choices. information that can be inferred from choices. This is shown in Panels 1.2b and 1.2c. 1.3 Prediction Based on Axiomatic Models Economic models are theoretical constructions that simplify and represent economic processes to help in the description, understanding and/or prediction of human behav- ior. As simplifications models are not meant to replicate a complex reality. Even if the model is not empirically accurate; it may still be useful in predicting human behavior. Consider the following example. Example 1.4 (GARP) A researcher wants to predict behavior for budget set, B 0 , defined by prices p0 = (1, 1) and income x 0 = 3.5. She models consumers’ behavior by assuming that they are utility maximizers. For that she observes two decision makers A and B, choosing bundles from two budget sets. The information is described in Table 1.1. 17 Obs px py x qA qB 1 5 15   1 5 1 5 5, 4 8, 8 4, 12 1 9   2 1 2 5 2, 4 Table 1.1: Data for Example 2.3 Data for consumers A and B, choosing from B(p1 , x 1 ) and B(p2 , x 2 ) This example is depicted in Figure 1.3. For budget set B 0 , choices q1A and q2A are not informative even if they are consistent with GARP, since any bundle from B 0 would be consistent with them. On the other hand, even if inconsistent with GARP, choices for decision maker B can potentially provide more information for predicting behavior on B 0 . For either q1B and q2B there is just a subset of feasible choices on B 0 that would be consistent with them and GARP. Allowing for error would deliver even more precise predictions that result from the intersection of the choices that are jointly consistent with minimally perturbed q1B and q2B . This example shows that empirical accuracy may be important but it is not the only relevant attribute of a model. In this regard, (Gabaix and Laibson 2008) recognizes that "predictive precision is infrequently emphasized in economics research". They argue that "strong predictions are desirable because they facilitate model evaluation and model testing (...) Models with predictive precision are useful tools for decision makers who are trying to forecast future events or the consequence of new policies.". Ideally, the best model is the one that shows the best fit given observed data and imposes stringent constraints on data, but these two properties tend to be mutually in- consistent. In section 1.4.4, I show that there is a link between the amount of informa- tion that can be learned from data about preferences and the extent to which the model imposes demanding constraints in data. Thus, by proposing a predictive approach to assess the suitability of the model, this chapter provides a meaningful compromise be- 18 (a) Subject A (b) Subject B Figure 1.3: Relation between empirical consistency, falsifiability and predictive ability. Empirical consistency -fit- is important but does not implies usefulness of the model. Even when falsifiable it may not restrict behavior enough to allow for learning about underlying behavior. tween fit and power. First, I extend the model to embed it in a statistical framework, allowing for an additive error to behavior. Within this framework, I recover the revealed preference information and error process from the data. Next, I construct the predictive distribution as the expected distribution of choices for an unobserved economic envi- ronment, given the information recovered from data. The predictive ability of a model is not an inherent property of the model, but it depends on the behavior that seeks to predict. Therefore the analysis is performed for the pair model-data –MD-program–, allowing for the comparison across data sets for a given model or across models for a given data set. The standard practice in economics is to do comparisons based on their accuracy explaining observed behavior (fit). But when dealing with axiomatizations, this cri- terion favors general models that deliver loose predictions, since models that deliver more precise predictions may fail to explain observed behavior. 19 Information criteria methods extend goodness of fit measures to account for model complexity7 , penalizing empirical accuracy to avoid overfitting. These are mostly used in the context of parametric model selection, where increasing the number of regressors almost always improves fit. In the context of axiomatic models, there is a similar trade off between the generality of the model and its empirical accuracy: general models are most likely empirical accurate but provide wide predictions, while more specific models potentially induce bigger errors but providing more precise predictions. In finite samples, axiomatic models induce an additional source of uncertainty since, conditional on a sequence of rational choices and the error process, the model deliv- ers a set prediction even for perfectly consistent data. In what follows I define this uncertainty as "model uncertainty". In this chapter, I follow the intuition of Bayesian Model Averaging (BMA) to account for "model uncertainty", by averaging across the predictions delivered by the model –that is, all preference relations consistent with the revealed preference information inferred from data–. 1.3.1 Set up The econometrician observes choices made by an individual that faces J decision prob-  J  J lems. Observed data, D consists of choices q j j=1 and prices and income p j , x j j=1  J with q j ∈ R+L , p j ∈ R++ L and x j ⊆ R++ for all j = 1, . . . , J, thus D ≡ q j , p j , x j j=1 . Assuming local non satiation8 , x j = p j · q j .9 Let B : R++ L+1 → R+L be the feasible consump- tion correspondence which is defined by imposing the affordability and non-negativity 7 See (Akaike 1998) 8 Definition (Local Non Satiation) For any q ∈ X and every " > 0 there exists a y ∈ X such that kq − yk ≤ " such that y is preferred to q. 9 Data on income it is not required in the analysis that follows, however, for the sake of uniformity on notation I keep income as a separate variable since new economic environments are defined by prices and income. 20 constraints, that is  B p j , x j ≡ {˜ q ∈ R+L : p j · q˜ ≤ x j ≡ p j · q j } QJ Let B(p, x) = j=1 B(p j , x j ) be the set of feasible consumption bundles sequences. The model is a correspondence M : B(p, x) → M ⊂ B(p, x) that maps from the set of feasible choices to the subset of feasible choices that is consistent with the model. That is M ≡ M (B(p, x)) = {m ∈ B(p, x) : m satisfies the model} Before extending the model to allow for error, I discuss the properties of the pre- diction provided by the deterministic model. The predictive distribution for the ex- tended model –Section 1.3.3– expands the deterministic prediction to incorporate error in choices, and therefore inherits its properties. 1.3.2 Predictions Based on GARP Given a vector of choices consistent with the model, m ∈ M (B), I define S ⊆ B 0 as the  subset of B 0 such that (m, s) ∈ M B, B 0 for all s ∈ S. I call this set the "supporting set". The "supporting set" is the prediction delivered by the axiomatic model, and, coincides precisely with the supporting set defined by (Varian 1982) when the model is GARP. Formally,  Definition 1.5 (Supporting set) Let m ≡ m1 , . . . , mJ be a vector of choices consistent with the model, i.e. m ∈ M(B(p, x)). Given a new economic environment B 0 , I define the "supporting set" given m, S 0 (m), as    S 0 (m) ≡ S p0 , x 0 |m ≡ s ∈ B 0 | (m, s) ∈ M B {p, p0 }, {x, x 0 } (1.1) 21 S 0 (m) (a) Choices and Supporting set (b) Choices and Supporting set Figure 1.4: Observed choices are a relevant determinant of the "supporting set". The size of this "supporting set" reflects the tightness of the identification of the un- derlying behavioral process, since, by definition, each distinct alternative in the "sup- porting set" represents a behavioral distinguishable and well-behaved preference rela- tion that is consistent with observed data. Therefore, the size of the "supporting set" reflects "model uncertainty", which depends on observed behavior, the distribution of budget sets and the number of observations. To illustrate these dependencies consider the following examples. Example 1.6 (GARP - "Supporting set" depends on observed choices) The researcher has information on two decision makers (a) and (b) that have faced the same budget sets. She is considering GARP as a model two predict choices. Their choices are presented in Table 1.2. Case pA xA qA pB xB qB p0 x0 17 25  (a) (5, 1) 5 40 , 8 (1, 2) 5 (1.5,1.75) (1,1) 3.5 1 15  (b) (5, 1) 5 4, 4 (1, 2) 5 (3.5,0.75) (1,1) 3.5 Table 1.2: Data for Example 1.6 This data is depicted in Figure 1.4. Given observed budget sets, choices for decision maker (a) provide more information for predicting behavior in B 0 that the decision maker’s 22 (a) Choices and Supporting set (b) Choices and Supporting set Figure 1.5: Observed economic environments are an important determinant of the "sup- porting set". (b) choices, showing that the size of the supporting set depends on observed choices. Example 1.7 (GARP - "Supporting set" depends on observed budget sets) The researcher has information on two decision makers, (a) and (b), that have faced different budget sets, but still have chosen the same consumption bundles. She is considering GARP as a model to predict choices. Table 1.3 summarizes the data for these two consumers. Case pA xA qA pB xB qB p0 x0 2.15, 53  (a) 10 (1,2) (2.5, 1) 6 (3.5,1.5) (3,5) 15 (b) (1, 1) 3 (1,2) (1, 1) 5 (3.5,1.5) (3,5) 15 Table 1.3: Data for Example 1.7 This data is shown in Figure 1.5. For the same observed choices and same objective B 0 , the results are considerably different, showing that the "supporting set" also depends on the environment in where choices were observed. For any B 0 the "supporting set" is non-empty if and only if it is constructed from a sequence of choices consistent with the model. The size of the "supporting set" (weakly) shrinks with the number of observations10 , since more constraints restrict the set of 10 Figure 1.6 shows simulations results that illustrate the convergence of the "supporting set" for a rational consumer and a randomly selected B 0 when J increases. 23 Figure 1.6: "Supporting set" and number of observations Relative size of the supporting set kS 0 (m)k as the number of observations increases. The relative size is defined as kB 0 k . Budget sets B p2 were generated at random such that p1 ∼ U[3, 10], ∼ U[.5, 1.5] and x ∼ U[90, 105] and p1 ¦ © j j choices were generated such that m ≡ ar g ma x q j ∈B j min q1 , 12 q2 . Fixing B 0 , the supporting j set is constructed based on {mi , p i , x i }ki=1 with k = 1, 5, 10, 25, 50, 100, 200, 500. behavior that is consistent with them. At the limit, when the considered model is GARP, the "supporting set" is a compact and convex set. Finally, for a sequence of budget sets that becomes dense in space of prices and income, the "supporting set" for GARP converges to a point. The latter point implies that, in the limit, if the sequence of budget sets is demanding "enough" –in the sense that they intersect sufficiently enough– the preference relation that generates the data is point identified. These results are formalized in Proposition 1.8. Proposition 1.8 (Properties of the "Supporting set") Let B 0 be the economic environ- ment for which the "supporting set" is constructed. 1. S 0 (m) 6= ; if and only if m ∈ M(B) 2. Let BJ , BJ+1 be a set of J and J + 1 budget sets respectively, with mJ ∈ M(BJ ) and  mJ+1 = mJ , mJ+1 ∈ M(BJ+1 ) then S 0 (mJ+1 ) ⊆ S 0 (mJ ) Moreover if M = {m ∈ B|m satisfies GARP} then 24 1. S 0 (m) is a compact, convex set S 2. For an increasing sequence of budget sets C n such that C n ⊂ C n+1 ⊂ . . . and n {C n } L×n is dense in R++ , then mn ∈ B(C n ) such that mn >> 0 S 0 (mn ) → m0 Proof. Follows from (Mas-Colell 1978) and (Blundell, Browning, and Crawford 2008). 1.3.3 Stochastic Extension and Predictive Distribution I extend the axiomatic model to allow for error, embedding it in a statistical framework. This stochastic extension is constructed under the assumption that observed choices are the result of decision maker’s maximization of her own utility function, m ∈ M, plus an idiosyncratic error, truncated due to feasibility constraints, q j = m j + "˜j with m1 , . . . , mJ ∈ M  This error can be understood as measurement error or error made by consumers when making decisions. The proposed approach does not rely on the nature of the error process; however, the predictive distribution constructed below does so. The reader in- terested in extending the construction proposed below for alternative error structures –for instance, error with respect to decision maker’s perception of her own utility func- tion or with respects to prices– needs to estimate the "candidate model" and the error process consistent with such alternative definition. By definition, observed and rational choices are feasible, that is, q ∈ B(p, x) and m ∈ M ⊆ B(p, x). Therefore the effective error process, "˜, is restricted such that per- turbed choices are feasible11 . Feasibility imposes two sets of constraints, nonnegativity 11 Alternatively one can consider the truncation of the distribution of the error process to its domain. 25 and affordability. First, for each budget set feasibility constraints restrict the distribu- tion of the error process to lie on the budget hyperplane, that is such that p j · " j = 0. Nonnegativity constraints are imposed as follows, q j = ar gminq˜ j ∈B j km j + " j − q˜ j k (1.2) for all j = 1, . . . , J, with " j ∼ F" | p j ·" j =0 and m ∈ M (B(p, x)) where F" is a well behaved distribution. Formally, I assume that, before imposing feasibility, the error process has a symmetric distribution with mean, median and mode 0, such that assigns decreas- ing probability to extreme values –the probability density function is decreasing in the absolute value of the error–. Formally, Assumption 1 (Unconstrained Error Process) The unconstrained process " j ∈ R L is assumed to be i.i.d. with continuous and symmetric p.d.f. with mean, median and mode 0 L and such that ∇" f (" j ) T Diag(" j ) < 0 for all " j 6= 0 and " q m, for all m ∈ M (B(p, x)) and for all for j = 1, . . . , J. The stochastic extension is defined for the domain where the empirical implications of the model are defined. In doing so, it allows the recovery of a "candidate model" in the budget sets faced by the decision maker and from it a natural construction of the predictive distribution for choices, extending the "supporting set" to reflect noise. I call this extension M" . Extending the model to allow for error induces a new source of uncertainty when 12 predicting behavior; I call this uncertainty "error uncertainty". Therefore, I extend The assumption adopted in this chapter is consistent with a fixed latent error process. The analysis could be carried out with this alternative assumption, but special attention should be paid to the estimation of the error process. 12 In this chapter I abstract from uncertainty with respect to the systematic component that generated the data. When predicting behavior for an axiomatic model three sources of uncertainty arise: (i) "error uncertainty" as described above; (ii) "model uncertainty" since for a given sequence of rational choices the model yields a multiple prediction; and (iii) "DGP uncertainty" which refers to the uncertainty about the 26 the "supporting set" -Definition 1.5- to reflect this new source of uncertainty. First, I recover the revealed preference information from data by the way of a minimum distance estimator. Definition 1.9 (Projection of observed choices onto the model) The problem is to find ÒMD of q onto M, the minimum distance estimator, as the solution to the projection m m ÒMD ≡ ar gminm kvec(q) − vec(m)k s.t m ∈ M (B) The estimation defined above cannot be performed following standard mathemat- ical tools due to the topology of the set of sequence of choices that is consistent with the model. For GARP, (Kocoska 2012) proposes a series of derivative-free algorithms to find the "closest" rational sequence of choices to observed behavior, in the economic environments in which choices were observed. This projection requires to minimize the distance between observed behavior and rational choices, for each possible sequence of rational choices, that is, for each possible preference order.13 This projection yields an estimate for the "candidate model" and for the error pro- cess.14 Next, I construct the "supporting set" for the estimated "candidate model". As discussed before, its size reflects the "model uncertainty". Finally I compute the predic- process that generated the data. For the main specification of the predictive distribution in this chapter I fix the data generating process to be the most likely one given data. In section 1.4.2 I proposed a model averaging alternative to account for uncertainty on the DGP. 13 Note that the results of this projection is dependent of measurement units. However, the projection technique is flexible to allow for the inclusion of weights, therefore one could weight by prices, or the ratio of prices to income, to remove the dependency with respect to the units of measurement. 14 In section 1.3.4 I show an algorithm to recover the underlying distribution of the error process even for small samples. 27 Distribution of s0 + "b|s0 s0 S0 B0 B0 S0 sC 0 sA 0 sD 0 s0 sB 0 (a) The expected distribution of choices (b) The predictive distribution Figure 1.7: Predictive Distribution - Intuition. The predictive distribution in the case of only one dimension. tive distribution by extending this construction to reflect the estimated error process. Given assumption 1, the "candidate model" is a vector of choices that most likely generated observed choices, given data. For m ÒMD , the estimated "candidate model",  ÒMD , s0 with s0 ∈ S 0 (Ò each distinct sequence of choices m mMD ) represents a distinct behavioral functional among the ones consistent with model M. Expected choices con-  ÒMD , s0 are given by the distribution of s0 + "b|s0 , where the distribution of ditional on m "b|s0 is estimated from the residuals of the projection. Figure 1.7 shows the intuition for the predictive distribution conditional on s0 . Thus, I define the predictive distribution as the distribution of s + "b|s with s ∼ FS 0 (Ò mMD ) . Without imposing further assumptions 15 all s0 ∈ S 0 (Ò mD ) are equally likely, therefore FS 0 (Ò mMD ) is assumed to be uniform on the sup- porting set. Figure 1.7b shows the expected distribution conditional different choices from the "supporting set". The predictive distribution is the average distribution for all such distributions. Definition 1.10 (Predictive Distribution for B0 given MD) Let observed choices q be generated by a systematic component m ∈ M and an error process " consistent with as- 15 If the researcher has prior information available, this definition can be extended to reflect this prior information. In particular, I will not longer be the case that all behavioral functionals constructed with different choices from the "supporting set" are equally likely. For a given prior, now the distribution of choices on the supporting set is given by the truncation of the prior to the "supporting set". 28 sumption 1, such that q = m + "˜ such that m ∈ M(B(p, x)) and q ∈ B(p, x) where "˜ is the truncation of the error process to satisfy feasibility constraints -consistent ÒD be a consistent estimator of m, and define "b = q − m with equation 1.2-. Let m ÒD . Then, I define the predictive distribution as the distribution of the random variable S + "b(S) with S ∼ US 0 (Ò mD ) and " b(S) ∼ F"b0 |S where F"b0 |S is the truncation of the estimated distribution of the residuals consistent with assumption 1 such that S + "b(S) ∈ B 0 , and US 0 (Ò mD ) is an uniform distribution over the 0 supporting set. Let fMD be the predictive density, then Z –Z ™ 1  0 fMD ( y) = R 1 y = ar gmin ˜y ∈B0 k ˜y − ηk f"0 (η − s) dη ds s∈S 0 (m ˆD) s∈S 0 (m ˆD) ds η:p0 ·η=p0 ·s  where f"0 (ξ) ∝ 1 p0 · ξ = 0 f" (ξ) Definition 1.10 combines a (non-standard) frequentist projection tool with a quasi- bayesian16 model averaging technique to construct the predictive distribution. By defi- nition, each distinct alternative from the "supporting set" represents and observationally distinct behavior among the ones consistent with the model. Then, the predictive dis- tribution is constructed as the average distribution across all these behaviors; where the weights are given by their relative likelihood, following the intuition of BMA. The following example illustrates this construction. Example (Continued 2.3) The predictive distributions for the decision makers A and B 16 A fully Bayesian approach would require the definition of a suitable prior over the set of well-behaved preference orders and a mechanism for updating, this is left for future research. 29 (a) Subject A (b) Subject B Figure 1.8: Relation between empirical consistency, falsifiability and predictive ability. Empirical consistency -fit- is important but does not imply usefulness of the model. Even when falsifiable it may not restrict behavior enough to allow for learning about underlying behavior. discussed in Example 2.3 are depicted in Figure 1.8. For consumer A the projection is equal to observed choices, i.e. m ÒA = qA; then estimated errors are 0. Therefore, the predic-  ˆ A = B0. tive distribution is given by a uniform distribution on the "supporting set" S 0 m On the other hand, choices observed for consumer B are not consistent with GARP. The Ò2B = 95 , 20  closest rational sequence of choices to observed data is mÒ1B = q1B and m 9 , thus 1 1   "b1B = (0, 0) and "b2B = − 18 , 36 . The "supporting set" is then defined by S 0 m ˆB = q ∈ R2++ : q1 + q2 = 3.5 and 83 ≤ q1 ≤ 2 . The resultant predictive distributions confirm  the intuition that the data for decision maker A, though consistent with the model, de- liver less informative predictions than the data for decision maker B. Figure 1.9 shows the predictive density for q1A and q1B over B 0 . The predictive distribution for model M given observed data D can be understood as the expected distribution of choices given data D as prescribed by model M. Its variance reflects the uncertainty of predicting behavior due to the error process –"error uncertainty"– and the systematic component –"model uncertainty"–. By construction, this distribution extends the "supporting set" to allow for error, and coincides with it if observed behavior is perfectly consistent with the model; case in which the predic- 30 Figure 1.9: Predictive densities for Example 2.3. The predictive densities for good 1 in B 0 constructed from the information in Example 2.3 following definition 1.10. tive distribution is given by an uniform distribution over the "supporting set". If the "supporting set" converges to a point under a sufficient information condition on the distribution of the budget sets –as it holds for GARP– then the predictive distribution converges to the distribution of the error process centered in the point prediction of the model. These properties are formalized in Theorem 1.11. 0 Theorem 1.11 (Properties of the Predictive Distribution) Let fMD be the predictive den- J sity for B 0 based on the extended model M" , conditional on data D = p j , x j , q j j=1 con-  0 0 structed as in Definition 1.10. Let YMD be a random variable with probability density fMD . 1. Reflects recovered error process If S 0 (Ò mMD−A) = S 0 (Ò mMD−B ) and kvec(Ò "A)k > kvec(Ò "B )k then   0 Diag Var YMD−A 0 |YMD−A ∈ B(¯s, δ) ≥ Diag Var YMD−B 0 0 |YMD−B ∈ B(¯s, δ) R for all δ > 0 such that B(¯s, δ) ⊂ B 0 where ¯s = S 0 (Ò mMD−A) R s ds ds S 0 (m ÒMD−A ) 2. Reflects the tightness of identification 31 "A)k = kvec(Ò If kvec(Ò "B )k and S 0 (Ò mMD−A) ⊃ S 0 (Ò mMD−B ) then   0 Diag Var YMD−A 0 |YMD−A >> 0 > Diag Var YMD−B 0 0 |YMD−B >> 0 3. Limiting Behavior S For an increasing sequence of budget sets C n such that C n ⊂ C n+1 ⊂ . . . and n {C n }  L×n is dense in R++ , with q n ∈ B(C n ) such that q n >> 0, with S 0 m n ÒMD → m0 then 0 FMD → m0 + Fe0 |m0 where Fe0 is the distribution of the underlying error process centered in m0 truncated to B 0 . 4. Varian 0 If q ∈ M then YMD ∼ US 0 (q) where US 0 (q) is a uniform distribution on S 0 (q). Unless otherwise stated all proofs are relegated to Appendix 1.A. The effect of the number of observation on the predictive distribution is, in principle, ambiguous. An additional observation potentially enhances predictive precision due to the shrinkage of the supporting set, Proposition 1.8. However, as the identification of the systematic component gets tighter, behavior previously consistent with the model would not longer be consistent, potentially increasing the variance of the error process and therefore increasing the dispersion of the predictive distribution. This construction provides a predictive distribution that :(i) is of interest for a re- searcher seeking to use the nonparametric model as a way of predicting behavior, this is shown in Section 1.4.1; and, (ii) allows for the assessment of the predictive ability, 32 P r o jec t ed−SSR Figure 1.10: Histogram of Gener at ed−SSR for J = 10, 25, 50, 100, and 100 repetitions. Choices p where generated from a Cobb-Douglas utility function u(x, y) = x y plus " consistent with the truncation defined in section 1.3.1, where the underlying error process is " ∼ N (0, 5). B(p, x) p was generated such that p1 ∼ U[5, 10], p21 ∼ U[.5, 2] and w ∼ U[80, 120]. which is shown in Sections 1.4.2 and 1.4.3. 1.3.4 On the Estimation of the Error Process Under some conditions discussed above, GARP is point identified. Therefore, in the limit definition 1.9 provides a consistent estimator of m and consequently for the distri- bution of ". Unfortunately, small samples are the norm rather than the exception when testing for axiomatic models of behavior17 . For any sample size, the residuals from the projection provide the relevant measure of the departures from the model. Nevertheless, the construction of the predictive dis- 17 Without imposing homogeneity assumptions the relevant dimension is the number of observations per decision making unit. Many data sets have information available for a large number of subjects, but cross-sectional studies have been shown to fail when accounting for heterogeneity in behavior, see (Lewbel 2001). 33 tribution, requires a consistent estimate for the underlying error process.18 For small samples, the recovered residuals underestimate the underlying error process. Consider figure 1.10, where choices were generated in the context of the rationality model with " ∼ N (0, 5) and J = 10, 25, 50, 100. The underestimation of the residual process is sig- " k is jointly defined nificant for all considered sample sizes. The gap between k"k and kb by the error process and the sample size, given B(p, x). In this section I propose a pseudo parametric approach for the estimation of the error process. The approach remains nonparametric on the identification of the systematic component but assumes a parametric assumption on the error process consistent with assumption 1. Let F"θ be the distribution of the error process before the truncation, parametrized by θ . For example, if F"θ = N (µ, σ2 ), then θ ≡ (µ, σ2 ). The problem then reduces to consistently estimate θ . Relying on the structure of GARP, I propose two algorithms to simulate the distri- bution F"θ without computing the projection in each iteration.19 The first approach constructs the consistent set M given D by intersecting, in a sequential manner, the supporting set for each budget set. For each i chosen at random from the set of ob- servations J ≡ {1, . . . , J}, first it constructs S i (m ˆ −i ). Then chooses k ∈ J \ {i} and constructs S k (m ˆ −{i,k} ) and do so until J is exhausted. The results for the estimation of this algorithm in a simulation setting are presented in Table 1.4. The results show that, even for a small sample of J = 40, the algorithm significantly improves the estimation 18 Note that the predictive distribution here is constructed by incorporating the error process for each element in the "supporting set". Alternatively one could construct the prediction as the "supporting set" allowing for error in its boundary. If the researcher were to follow this alternative approach, then the relevant error process is the recovered error process. 19 Ideally θˆ can be estimated from the simulation of the effective (observed) error process for an ex- haustive grid on θ , that is θˆ = ar gminθ0 kFeθ0 Fe k, where Fe is the observed distribution of the error process and Feθ is the simulated effective distribution of the error process for θ = θ0 . Estimating θˆ in this manner implies computing the projection of simulated data to the model for each iteration in the simulation for each possible value of θ which is really expensive computationally. 34 σ u(·) 0.5 1 2 3 5 10 1 3 4 4 x1 x2 0.9040 0.7618 0.4052 0.6251 0.9218 0.9808 1 1 2 2 x1 x2 0.9570 0.8816 0.5528 0.5149 0.8918 0.9794 Table 1.4: Simulations for the Algorithms. Kolmogorov-Smirnov two tailed test statistics with respect to measure constructed based on the projection for σ = 3 for two alternative distributions over a CD utility function. of the underlying error process with respect a nonparametric estimation of the distri- bution of the residuals of the projection. The second approach exploits the symmetry of f"θ -Assumption 1- to estimate θ as the parameters that maximize the hits to the set of rational choices from observed choices. These algorithms are formally defined in Appendix 1.B.1. 1.4 Predictive Performance Analysis 1.4.1 Decision Making and the Predictive Distribution The predictive distribution is a sufficient statistic for a researcher that seeks to predict behavior based on the model. Consider a decision problem where the utility of the deci- sion maker, v(·), depends on her own decision, a, and on the behavior of the consumer (or consumers), q, that could depend on a. The decision maker models consumer’s de- cision by a model M. Given observed data D, the expected distribution of choices for a new economic environment, B 0 , conditional on a and M, is the predictive distribution, 0 FMD (q|a). The knowledge of the predictive distribution allows the decision maker to compute her optimal choice, a∗ |M = ar gmina E FMD 0 (q|a) [v(a, q(a))] , and the expected value of her decision, v ∗ |M = E FMD 0 (q|a∗ ) [v(a |M , q(a |M ))]. The expected value of the ∗ ∗ problem can be interpreted as the negative of the loss due to the uncertainty when 35 predicting behavior by assuming the model M. Under standard concavity assumptions on the preferences of the decision making, if consumers behavior affects the optimal choice, then higher uncertainty to predict behavior translates in lower expected utility. To illustrate the relevance of the predictive distribution in decision making, consider the following example. Example 1.12 (GARP vs Cobb-Douglas) The producer of good 1 in this economy wants to project her revenue for B 0 . Her utility function is v : R → R+ . For simplicity assume that there is only one consumer, whose choices maximize her utility function u = min {2x, 3 y} without error, but this is unknown to the researcher. Choices are observed for the same budget sets as in Example 2.3. The researcher is considering two alternative models of behavior, GARP -model I- and Cobb-Douglas utility function-model II-. Figure 1.11 shows the predictive distribution for both models. Even if empirically inaccurate, the more specific model -model II- provides more information to predict behavior. Specifically, for GARP the Revenue is uniformly distributed in [0.38, 3.45], with expected revenue 1.915 and variance 0.79. On the other hand, the Cobb-Douglas model yield a predictive distribution with two mass points in {1.16, 2.52}. For model II the expected revenue is 1.84 while the variance is 0.4624. The best model for the decision maker depends on her preferences, that is the R  0 best model is the one that maximizes her expected utility, i.e. supp(q0 ) v p0 · q˜ fMD (˜ q)d˜ q. MD If v = Revenue then she prefers GARP, while if the utility is concave in revenue, she may prefer model II. Given a set of candidate models, the decision maker can either exploit the informa- tion provided by all models to construct a average predictive distribution, where the weights for each candidate model is provided by their marginal likelihood. This con- struction is discussed in Section 1.4.2. If the decision maker seeks to choose one model from the set of candidates, she chooses the model that maximizes her expected utility 36 (a) Model I (b) Model II Figure 1.11: Relation between empirical consistency, falsifiability and predictive ability. –or minimizes the expected loss–. The optimal model depends on the preferences of the decision maker -v(·)-, the particular application -B 0 - and available data -D-. Conditional on B 0 and D, sufficient conditions have been proposed in the literature for a model to be better than others in terms of predictive information, see (Blackwell et al. 1951), (Cabrales, Gossner, and Serrano 2010). Such conditions cannot be generalized for any v(·), and do not provide a complete order. To allow for a complete comparison, in Section 1.4.3 I provide a series of statistics that summarize the predictive information extracted from data D given the model M. 1.4.2 Marginal Likelihood and Model Averaging Given a set of k candidate models, M1 , . . . , Mk , the researcher interested in predicting behavior can use the predictive distribution to compute weights for model averaging. The model averaging predictive distribution (MAPD) is defined as the average of the he predictive distribution for each of the candidate models constructed as in Defini- tion 1.10, where the weights are constructed as the relative likelihood of each of these models. The MAPD induces a new source of uncertainty when predicting behavior, "DGP uncertainty", that refers to the fact that there are many plausible models that may 37 be generating observed behavior. I follow the intuition of Leave-one-out (LOO) cross validation to compute the LOO marginal likelihood.20 For each observation j ∈ {1, . . . , J} the predictive distribution over B j is constructed with the observations excluding j, as in definition 1.10. The likelihood to correctly predict behavior can be computed as j  p q j |MD− j = fMD− j (q j ) j J where fMD− j is the probability density function for predictive distribution FMD −j as de- fined in 1.10 for data D for which observation j is not considered. This probability provides a measure of the predictive performance of the model in sample; the better the model the more accurate predictions it delivers and the higher the probability of 21 correctly predicting observed behavior. J Definition 1.13 (Model Evidence ("Marginal Likelihood")) Let FMD −j be the predictive distribution as in Definition 1.10 constructed for B j from data D for which observation j j is not considered and let fMD− j be its probability density function. Then, J Y p (q|M" ) = j fMD− j (q j ) j=1 Given a prior distribution on the set of candidate models π : (M1 , . . . , Mk ) → [0, 1], 20 By construction, the "posterior marginal likelihood" does not reflect "model uncertainty" since, if I were to include q j when constructing the predictive distribution for B j , the supporting set collapses to Ò j and the predictive distribution collapses to the distribution of the error process. To reflect the true m uncertainty for predicting behavior I proceed with the LOO predictive distribution. 21 An alternative measure of the predictive performance in sample can be constructed from defini- tion 1.15. For observation j define C j q j as the biggest HPD set that does not contain q j , then  1 − P C j q j is the confidence at which the researcher can be certain that the interval prediction for the model contains observed data. 38 the posterior probability of the model can be computed by p(M"k |D) ∝ p q|M"k × π(Mk )  Then, the average predictive distribution, is defined as the weighted average of the predictive distributions for all candidate models. Formally, k X p M"k |D fM0k D ( y)  0 f BM A(M1 ,...Mk )D ( y) = i=1 1.4.3 Measures of Predictive Ability Predictive Model Assessment and Model Selection The marginal likelihood as in Definition 1.13, can be understood as the probability of correctly predicting observed behavior under the extended model, when the prediction is perform considering the other J − 1 observations. As constructed, even if slightly different from the standard definition22 , the marginal likelihood provides a measure of the evidence for the model. If the decision maker seeks to choose one model from the set of candidate models, this can be done from the marginal likelihood. The Bayes factor provides a summary of the evidence provided by the data in favor of one model 22 Note that if we were to include the observation j when computing p(q j |M"k ), this collapses to the distribution of the error process around the "candidate model"; that is, Definition 1.14 (Model Evidence ("Posterior Marginal Likelihood")) Let m ÒD be a consistent estimator of m as in definition 1.10, and let f"b be the estimation of the underlying error process from "b = q − m ÒD . Then, J § Y € Š p (q|M" ; m = m j  ÒD ) = 1 q j >> 0 f"b q j − m ÒD j=1 Z ª € Š j    + 1 − 1 q j >> 0 f"b q˜ j − m ÒD d˜ qj q˜ j :q j ≡ar gmina∈B j k˜ q j −ak 39 23 versus the other. p(q|M"i ) Bayes Factori,k = p(q|M"k ) The critical value to decide for one model over other depends on the loss function. For interpretation it can be useful to consider twice the logarithm of the Bayes factor since it has the same structure as the familiar deviance and likelihood ratio test statistics.24 Alternatively, the logarithm of the marginal likelihood of the data can be interpreted as a predictive score, then the logarithm of the Bayes factor can be interpreted as the difference on the predictive scores for the considered models. Measures of Predictive Precision Given a data set D, a more informative model delivers tighter predictions. For a given model M, a data set D A provides more information than other D B , if it allows for the con- struction of more informative predictive distributions. Then, a natural way to measure the performance of a MD−program is by measuring the dispersion of its predictive dis- tribution. The two summary statistics I propose are: (i) the size of the smallest α-level confidence interval and (ii) the Kullback-Leibler divergence measure with respect to an uninformative prior. The former follows the natural intuition that more useful models deliver "narrower" confidence sets, while the latter collapses on a measure of the en- tropy of the predictive distribution; better models reduce the uncertainty for predicting behavior. Size of the Smallest α-level Confidence/Credible Interval The performance of the MD program can be gauged by sizing the certainty to predict behavior by imposing the model M given observed data D. I measure it by the size of the smallest α-level confidence/credible connected set–that is the highest posterior density set among all 23 For a extensive analysis of Bayes factors and their interpretation refer to (Kass and Raftery 1995) 24 Its interpretation as the likelihood ratio test depends on defining on of the models as the null hy- pothesis and the other as the alternative. 40 connected subsets of B 0 . The smaller the α-level HPD interval (set) the more precise the predictions delivered by model M given observed data D. I define the predictive precision measure as the relative size of the complement of the HPD interval. Formally, Definition 1.15 (Predictive Precision - Credible Set) Let Cα0 be the 100(1−α)% high- est posterior density (HPD) connected set for the predictive distribution F 0 constructed given model M. That is, Cα0 ≡ ar gminC α |C α is connected : P F 0 (q ∈ C α ) = 1 − α Conditional on data, {q j , p j }Jj=1 , the predictive precision α measure for budget set B 0 is defined by   P Pα0 |MD ≡ 1 − P q0 ∈ Cα0 (1.3) where P = UB0 is the uniform distribution over the budget set B 0  c The smaller Cα0 , the bigger Cα0 . Under (Becker 1962)’s model of irrational behav- ior, there is nothing that can be learned from data, therefore P Pα0 |Becker = α. 25 Then, P Pα0 |MD − α can be interpreted as the improvement in precision of predicting behavior by using the model M with respect to not using any model at all. Information theory based measures An alternative way to measure the usefulness of the MD−program is by sizing the gain in information to predict behavior by as- suming model M given data D with respect to an uninformative prior. Let F 0 be the predictive distribution constructed for B 0 given model M, and let P be an uniform dis- tribution over B 0 , P = UB0 26 . I propose to measure the predictive ability of the model by the "distance" between these distributions. The proposed "metric" in this chapter 25 This is not the same as data being generated by this alternative data generating process and then imposing GARP as a model. This measure can be constructed to provide further information to the researcher about the power and significance of the measure 26 This measure can be constructed for any other probability measure that the researcher consider pertinent, the results and properties are qualitatively the same. 41 is the Kullback-Leibler divergence measure27 , while not a metric in the formal sense, the KL-divergence measure provides an intuitive measure of the gain in information by imposing the model. Definition 1.16 (Predictive precision index - Information theory) Let F 0 be the pre- dictive distribution constructed for B 0 given model M and data D and let P = UB0 . Then,  P PK0 L |MD ≡ DK L F 0 ||P (1.4) The connection between the Kullback-Leibler divergence measure and Shannon en- tropy, DK L F 0 ||P = −H( f 0 ) − E F 0 (ln p)  Since P = UB0 , then E F 0 (ln p) = ln 1n = − ln n where n = R B0 d x, then  P PK0 L |MD = ln n − H f 0 where H(·) is the Shannon entropy, and f 0 is the density function for F 0 , has been well established in the literature. The entropy of the predictive distribution is the uncertainty of predicting an outcome by using it. Then, maximizing this measure is equivalent to minimizing the entropy of the predictive distribution. Comparison of MD programs An intuitive way to make the comparison across programs is based on the predictive pre- cision measures. In order to use these measures independently of the selection of B 0 , 27 If the researcher is interested in an alternative measure, e.g. Hellinger or Total Variation, all the results in this chapter follow through. 42 I propose a "leave-one-out" (LOO) estimation.28 The leave-one-out predictive precision measure is constructed as an average, across observations, of the predictive precision measures constructed for B j for model M and data D− j , i.e. data D where observation j is excluded. This average can be constructed as a simple arithmetic or geometric mean, weighted mean, median, or any other summary statistic that the researcher considers relevant. Definition 1.17 (Leave-one-out Predictive Precision (P P LOO )) The leave one out pre- dictive precision measure for program MD is defined as that is ¦ ©J  j LOO P PMD =M P P− j j=1 where M is an average summary statistic M (·) ∈ {mean,median, . . . } and j j P P− j = P PM{D\D j } The leave-one-out predictive precision removes the arbitrariness in the selection of budget sets while allowing to measure the tightness of the identification of underlying behavior for the observed budget sets. Section 1.4.4 shows the relationship between this approach and the power literature.  Let P P 0 |MD ∈ P Pα0 |MD , P PK0 L |MD . The proposed measures inherit the desirable properties from the predictive distribution, favoring more informative predictive dis- tributions. Let 0P be the linear order induced by P P 0 |MD on programs, that is MAD A is said to be more informative than MB D B , i.e. MAD A 0P MB D B , if and only if P P 0 |MAD A > 28 Alternatively a "grid" simulation over a representative sequence of budget sets can be constructed. Computing the predictive precision measures for a sequence of budget sets and summarizing these pro- vide the researcher with an informative measure of the predictive ability of the model given observed behavior. But in order to remove any dependency on the selection of the particular sequence of budget sets, this sequence needs to be exhaustive enough, incurring additional computational costs. 43 P P 0 |MAD A . Definition 1.18 (Information order) The information order, 0P P, is defined by the partial order induced by measure P P 0 |MD on programs. That is, MAD A 0P MB D B ⇔ P P 0 |MAD A > P P 0 |MB D B Conditional on data D, the information order defines a complete order on the set of ˜ for B 0 models. Then a model M is said to be predictive-wise more informative than M, conditional on D if P M 0|D ˜ ⇔ P P 0 |MD > P P 0 |MD M ˜ Analogously, conditional on the model M, the information order defined a linear order on data sets. D is said to be predictive-wise more informative than D 0 conditional on model M if P D 0|M D 0 ⇔ P P 0 |MD > P P 0 |MD 0 Properties of Measures of Predictive Ability The leave-one-out (LOO) predictive precision measures allow the researcher to assess the performance of the model-data program in terms of its predictive ability regardless of the particular forecasting application. The LOO predictive precision measures favor MD programs that are (i) empirically more accurate and/or (ii) more specific. This is because these measures summarize the fit of the data for the considered model, and the extent of the uncertainty to identify the systematic model component from observed behavior, through the effect of the error process and supporting set respectively. ˜ 0 be two alternative model-data Theorem 1.19 (Program Selection) Let MD and MD ˜ are two models with convex supporting sets and assumed programs. Assume that M and M 44 that the predicitive distributions are defined consistent with Definition 1.10. Then, " )k < kvec("ˆ˜)k. Then there   ˆ − j |MD = S j m 1. If S j m ˜ 0 for all j and kvec(ˆ ˆ − j |MD exists α ¯ ∈ (0, 1) such that for all α < α ˜ 0 ¯ MD αP MD " )k = kvec("ˆ˜)k and M ⊂ M. 2. If kvec(ˆ ˜ Then there exists α ¯ ∈ (0, 1) such that for all α<α ˜ 0 ¯ MD αP MD 3. For increasing dense sequences D n and D 0n such that S 0 (m ˜ˆ n ) → ˆ n ) → {m0 } S 0 (m ˜ 0 } for any B 0 with m {m ˜ˆ n consistent estimators for the systematic component ˆn = m ˜ 0n respectively, if kvec(")k < kvec(˜ for sequences of programs MD n and MD " )k then ˜ 0n MD n  P MD A D A > P PMB D B 0 where MAD A  P MB D B ⇔ P PMLOO Understanding the Contribution to Predictive Precision: Fit vs Power The construction defined in 1.10 allows the disentanglement of the contribution of fit and the certainty of predicting the systematic component, to predictive precision. First, I construct the predictive distribution that would be obtained if observed behav- ior were equal to the projection. This "zero-error" predictive distribution reflects the uncertainty of predicting behavior only due to "model uncertainty". The predictive pre- cision measures constructed based on this "zero-error" predictive distribution are better than for the overall predictive distribution, due to the effect of the error process. Thus 0 0 P PMD |"=0 ≥ P PMD |"ˆ, and Contribution of " to P PMD 0 = P PMD 0 0 |"=0 − P PMD |"ˆ 45 1.4.4 Comparison to other measures in the literature Comparison to goodness of fit measures To address the sharpness of GARP, several goodness of fit measures have been proposed in the literature. Conditional on the sequence of observed budget sets, the proposed measures reflect other measures in the literature through the effect on the error process. I consider the comparison to three types of measures: (i) Houtman-Maks index (HM), (ii) Afriat and (iii) the Money Pump Index (MPI). While HM is a counting measure that measures the size of the largest subset of choices that is consistent with GARP, (ii) and (iii) are measures of the extent of the violations. Afriat’s measure is defined as the minimum perturbation to income required to remove all inconsistencies, therefore reflecting the extent of the worst violation. MPI measures "amount of money that could be extracted from the consumer" that violates GARP, therefore providing a measure of the mean extent of these violations. Formally, Definition 1.20 (Measures of fit) Set M = GARP and data set D. The Houtman-Mask29 index is defined as |D M | H M = max such that D|J M ∈ M|J M J where J M is a subset of {1, . . . , J}. The Afriat30 index is defined as e∗ = ar gmax e D ∈ M(e) where M(e) imposes the acyclicality condition on the Revealed Preference-e binary relation 29 (Houtman and Maks 1985) 30 (Afriat 1967) 46 where q j is said to be revealed preferred to q k at efficiency level e if ep j · q j > p j · q k . The Money Pump Index31 for each sequence (q k1 , . . . , q kn ) ∈ / M is defined as Pn  l=1 p kl · q kl − q kl+1 M P I(qk1 pk1 ),...,(qkn ,pkn ) Pn l=1 p kl · q kl Proposition 1.21 (Connection to Goodness of Fit Measures) Consider M = GARP and assume that D j = D 0 j for all j 6= i, q, q0 >> 0, and m ˆ =m ˆ 0 . Then, there exists α ¯ ∈ (0, 1) such that 1. If H M (q) > H M (q0 ) then MD αP MD 0 2. If Af r iat(q) > Af r iat(q0 ) then MD αP MD 0 3. If M P I(q) < M P I(q0 ) then MD αP MD 0 for all α < α ¯ Proposition 1.21 shows that, conditional on the identification of the systematic com- ponent, the LOO predictive precision measure reflects the extent of the violations of the model. As discussed in section 1.2.2, high fit can be the result of behavior generated by a rational decision maker or the inability of the program to detect departures from the model. Therefore, the propose information order does not replicate the order defined by measures of fit in the literature but partially reflects the information provided by them. Comparison to approaches that account for power For a given model, there are two sources of uncertainty: the uncertainty to predict the systematic component and the error process. Measures of fit only provide information about the first of these sources of uncertainty. Good fit can hardly be interpreted as a 31 (Echenique, Lee, and Shum 2011) 47 success for the model since conditions may be so undemanding that "anything goes". The standard approach in the literature is to assess the significance of measures of fit is to compare the results to those that would be obtained if data were to be generated randomly from a uniform distribution over feasible choices–(Bronars 1987)–, consistent with the definition of irrational behavior proposed by (Becker 1962). Given fit, higher power is desirable, and given power higher fit is desirable, but the ranking prescribed by these two properties tends to be negatively correlated and most approaches do not provide a way to combine fit and power. (Beatty and Crawford 2011) (BC henceforth) incorporates (Bronars 1987) adjusting a measure of fit by the size of the target area, defined as the probability of choices being rational if they were to be generated by a random subject, that is, an ex-ante measure of power. Formally, Definition 1.22 (Measure of Success ((Beatty and Crawford 2011))) BC defines the measure of success as m = r − a where r ∈ {0, 1} is a measure of pass/fail with respect, r = 1 ⇔ q ∈ M(B)32 and a ∈ {0, 1} is a measure of the target area, that is the relative size of the set of consistent choices relative to the whole outcome space. BC measure provides a statistic that, by controlling for the stringentness of the con- straints imposed by the model, allows to compare across behaviors that may not have been observed in the same economic environments. By adopting a measure of power in the spirit of (Bronars 1987) the authors do not account for the actual pattern of choices observed. As it was argued in previous sections, this is a significant determinant when assessing whether the model is imposing demanding constraints on data or not. 32 BC also proposes a continuous version of success that reflects the distance between observed data and the set of sequences of choices that are consistent with the model; that is, r ∈ [0, 1]. 48 By construction, the predictive ability approach accounts for power conditional on choices, through the effect on the "supporting set". For each observation j ∈ {1, . . . , J}, the predictive precision measures constructed using observations {1, . . . , J}\{ j}, reflect the size of the "supporting set", S j (m ˆ − j ) –Theorem 1.19–. Then under (Becker 1962) hypothesis of irrational behavior, we have that, kS j (m ˆ − j )k   = 1 − P (m ˆ −j, q j) ∈ /M kB j k ˆ − j ∈ M|D \ D j and P = UB j . That is, the relative size of the "supporting set" is where m the complement probability of detecting violations to GARP in B j under the assumption that choices are irrational, conditional on a sequence of choices for budget sets B i with i 6= j consistent with the model. Exploiting this relation between the ability to infer information from data to predict behavior and power, the proposed measures tie together the literature on fit and power providing a meaningful trade off in terms of their contribution to predict behavior. By construction, the assessment of power in this chapter is done conditional on observed choices. In doing so, these measures do not replicate the information provided by BC measure. Next example illustrates the relevance of the difference between ex-ante and conditional power. Example (Continued Example 2.3) Consider again choices presented in Example 2.3. The measures of predictive precision are P Pα0 |MD A = 0.05, P Pα0 |MD B = 0.56 for the measure defined in 1.15 and P PK0 L |MD A = 0.01 P PK0 L |MD B = 0.21 for decision maker A and B respec- tively. According to these measure GARP provides more precise predictions for D B than D A. / M, then success rate for subject A is bigger than for By construction qA ∈ M while q B ∈ subject B, rA > rB , while the size of the target area is identical, by construction, for both subjects, aA = aB . Therefore (Beatty and Crawford 2011) would decide in favor of D A. 49 Figure 1.12: Comparison BC and predictive ability measures Data in (Beatty and Crawford 2011). The difference between ex-ante power and a conditional assessment of power is important when studying the empirical performance of a model. In section 1.5 I show that, for the considered data set, power adjustments in the spirit of (Bronars 1987) be- come negligible, since ex-ante power rapidly converges to 1. That is, for small samples BC measure allows to account for differences in the design that could lead to differ- ences in the power to identify inconsistencies, as shown in Figure 1.12, but these dif- ferences become indistinguishable as the number of observations increases. This seem- ingly uniform power across different subjects however mask significant differences on the amount of information that can be inferred from data to predict behavior. Fig- ure 1.13 shows that, even for small samples, BC measure masks significant differences on predictive ability that show seemingly identical power. 50 Figure 1.13: Comparison BC and predictive ability measures Even when BC measures allow to differentiate among designs when the sample size is small, this measure still masks significant differences in the predictive ability the model given observed behavior. These computations were realized for the data studied in (Beatty and Crawford 2011). 1.4.5 Optimal Design of Experiments The proposed measures can be used for optimal experimental design. Consider a re- searcher that needs to design a sample/experiment to maximize the predictive power  J of a model M and has available information about past behavior, D J ≡ D j j=1 . Problem 1.23 Let B 0 be the objective budget set for which the researcher wants to predict behavior33 . The researcher has data on behavior for J economic environments, D J . She has been offered a set of experiments to run, and she seeks to optimize precision to predict behavior. For a set of candidate experimental designs B1J+1 , . . . , BdJ+1 , assuming the data gener- ating process is stable across all observations, B J+1 can be chosen using the proposed 33 If there is not an application in mind the researcher can follow the approach presented in Sec- tion 1.4.2 to maximize the amount of information to identify underlying behavior 51 measures of predictive precision. If the objective is to predict behavior, the optimal se- lection of B J+1 is the one that improves the predictive precision the most. To construct the predictive distribution including B J+1 , I first construct the predictive distribution for B J+1 conditional on data from observations {1, . . . , J}. Then, either (i) construct a point prediction from the predictive distribution by computing its mean, median or highest likelihood; or (ii) use the information from the predictive distribution to com- pute a weighted measure of predictive precision for B 0 where weights are given by their likelihood given F J+1 |MD J . Definition 1.24 (Optimal Design - Point Prediction) Let qˆ J+1 ≡ ar gmax q˜ f J+1 (˜ q), where J (·) is the predictive density for budget set B J+1 J+1 fMD constructed using D J . The optimal se- lection is defined by J+1 Bopt imal ≡ ar gmax B˜ J+1 P PM0 D˜ J+1  ˜ J+1 ≡ B(˜p J+1 , x˜ J+1 ) for some prices ˜p J+1 where D˜ J+1 ≡ D J ∪ qˆ J+1 , ˜p J+1 , x˜ J+1 and let B ˜ J+1 6= B i for all i ∈ {1, . . . , J}. and income x˜ J+1 such that B J (·) be the pre- J+1 Definition 1.25 (Optimal Design - Full Predictive Distribution) Let fMD dictive density for budget set B J+1 constructed using D J , then the optimal selection is defined by Z J (˜ q J+1 )P P 0 |MD˜ J+1 d˜ J+1 J+1 Bopt imal ≡ ar gmax B˜ J+1 fMD q J+1 ˜ J+1 B  ˜ J+1 ≡ B(˜p J+1 , x˜ J+1 ) for some prices ˜p J+1 where D˜ J+1 ≡ D J ∪ q˜ J+1 , ˜p J+1 , x˜ J+1 and let B ˜ J+1 6= B i for all i ∈ {1, . . . , J}. and income x˜ J+1 such that B The example below illustrates how this criterion works. Example (Continued 2.3) Consider again the decision maker A from example 2.3. The researcher is considering expanding the sample to consider an additional observation from a third budget set B 3 and wants to select it to improve its forecast on B 0 . She is considering two alternatives Ba3 and B 3b as shown in table 1.5. 52 Obs px py x qˆ3 3a 1.75 1.25 5 (1.427, 2.002) 3 3b 4 2 5 (1.255, 3.321) Table 1.5: Experimental design. Data for consumer A, choosing from B 1 and B 2 . The re- searcher is considering expanding the sample to include a third observation from B 3 to tight the prediction for B 0 . The two alternatives he is considering are Ba3 and B 3b . qˆa3 is computed as in definition 1.24 The computation for the expected predictive precision if budget set B 3 is incorporated follows Definition 1.24. The estimated qˆ3 is displayed in the table above. For D˜a3 the predictive precision measures for forecasting on B 0 are given by P Pα0 |MD˜a3 = 0.39 and P PK0 L |MD˜a3 = 0.12; while for D˜b3 the predictive precision measures for forecasting on B 0 are given by P Pα0 |MD˜b3 = 0.05 and P PK0 L |MD˜b3 = 0.01. For predicting behavior on B 0 , it is expected that Ba3 aids forecasting to a larger extent than B 3b . The respective predictive distributions are depicted in Figure 1.14. Note that if the researcher seeks to design the sample but does not have any in- formation available, the best it can be done is to maximize the ex-ante probability of detecting violations for B 0 , that is, power as it is traditionally understood in the revealed preference theory. 1.5 Empirical Application In this section I use data from (Choi, Fisman, Gale, and Kariv 2007) to show that: (i) more observations lead to more precise predictions, (ii) the proposed measures provide additional information to value the usefulness of a model and do not replicate the rank- ing prescribed by other measures in the literature, and finally (iii) I use the proposed measures to compare the predictive performance of GARP versus the maximization of 53 (a) Expanding the Sample to include Ba3 (b) Expanding the Sample to include B 3b Figure 1.14: Optimal experimental design for prediction a Cobb-Douglas utility function. This data set was obtained from a series of experiments designed to study decisions under uncertainty. The 93 subjects in the experiment are presented with a graphical interface displaying standard, two dimensional, budget constraints on the screen. The experiment was conducted at the Experimental Social Science Laboratory at UC Berke- ley and each session consisted of 50 independent decision rounds. Summary statistic are shown in Appendix 1.C. 1.5.1 Effect of Additional Information Additional information enhances the identification of the systematic component which has two effects: (i) improves the precision due to the shrinkage of the supporting set and (ii) improves the identification of the underlying error process which increases uncertainty of predicting behavior. Ex-ante, the net effect is ambiguous. Intuitively, an additional observation from a budget set that intersects at least one of the previously observed ones, may improve the identification of the systematic component by making previously consistent behavior not longer congruent with the model. But, if that is the case, the distance between that observation and the closest consistent one, given by the 54 Sample Mean for 25(half) and 50(full) observations per subject Full Half Difference Test-statistic P-value R2 0.9916 0.9968 -0.0052 1.6403 0.1051 (0.0023) (0.0020) (0.0032) Adj R2 0.1071 0.1118 -0.0046 0.4220 0.6742 (0.0072) (0.0092) (0.0109) Mean ex-ante prob 0.1124 0.0967 0.0157 1.3663 0.1759 (0.0084) (0.0095) (0.0115) BC with r discrete 0.2078 0.6232 -0.4154 -5.7306*** 0.0000 (0.0465) (0.0556) (0.0725) BC with r continuous 0.9923 0.9967 -0.0044 -1.4482 0.1492 (0.0022) (0.0020) (0.0030) Size supporting 0.5219 0.6301 -0.1081 -9.9913*** 0.0000 (0.0062) (0.0116) (0.0108) PP CI (95%) 0.5186 0.4246 0.0941 14.9986*** 0.0000 (0.0056) (0.0072) (0.0063) PP KL 0.2572 0.1664 0.0909 8.1483*** 0.0000 (0.0096) (0.0058) (0.0112) Table 1.6: Summary statistics for (Choi, Fisman, Gale, and Kariv 2007) - half and full sample. Data for 77 subjects, (Choi, Fisman, Gale, and Kariv 2007), test for differences in means. Standard error in (). BC with r discrete defines r = 1 if and only if observed data is consistent with GARP and 0 otherwise; while BC with r continuous defines r = R2 from the projection. change in the magnitude of the observed error process. To study the effect of additional information in the ability to predict behavior, I compare the predictive ability constructed when considering the first 25 observations with the same measures constructed considering the entire sample.34 The results are presented in table 1.6. This table displays the mean across subjects for measures of fit, adjusted fit, BC measures and the predictive precision measures –PP–, for the full sample –first column– and the first half of the sample –second column–. The third column dis- plays the difference between column 1 and column 2; while columns 4 and 5 show the t-statistic and p-value for the hypothesis that these means are not significantly different. Raw fit and the mean ex-ante probability of detecting violations are not significantly 34 For a robustness check, the same exercise is realized by splitting the sample into odds and even observations, using the former to construct the measures and compare to the whole sample. The results are not significantly difference. Table 1.12 in Appendix 1.C 55 different when comparing 25 versus 50 observations. Therefore there is evidence that preferences are stable and that the process for budget set is stable as well. The BC measure provides contradictory information when using the discrete or the continuous measure of success. This is explained by the fact that, for the considered data, the relative size of the target set is approximately 0 for all subjects. Thus, the discrete version adjusts a binary measure of whether the data satisfy GARP by a mea- sure that is approximately 0 for all subjects, therefore providing the same information as the deterministic test. Since the number of rational individuals drops from 63% to almost 20% when moving from 25 to 50 observations, the discrete measure favors smaller samples. On the other hand, the continuous version of BC measure collapses to the measure of success considered, therefore it cannot significantly detect differences when considering 25 of 50 observations, since there are not significant differences in fit. In contrast, the predictive precision measures improve with additional observations. The precision when predicting behavior is significantly improved due to the enhanced identification of the systematic component which is shown in the shrinkage of the "sup- porting set". Although the difference in the ex ante mean probability of detecting vio- lations is not significant, the overall probability increases with the sample size, which explains the significant difference of the mean size of the supporting set. Nevertheless, the effect of conditional power is not negligible even for the full sample, providing in- formation about the stringentness of the constraints across subjects in contrast to BC measure. This is further studied in next section. 56 Subject 516 Subject 313 2 R 0.9998 0.9999 Varian 0.9750 0.9700 BC € Š 0.9998 0.9999 R2 − 1 − P(viol) 0.2836 0.2931 PP CI (95%) 0.6096 0.4867 PP KL 0.3134 0.2636 Size supporting 0.4335 0.5713 Table 1.7: Statistics comparison. Data for subjects 516 and 313, (Choi, Fisman, Gale, and Kariv 2007) 1.5.2 Empirical Performance of Predictive Ability Measures and Com- parison to Other Measures in the Literature The difference between the predictive measures and BC measure of success can be shown by comparing two subjects. Consider subjects 516 and 313, the choices for these subjects are shown in Figure 1.15 in the Appendix. Figure 1.16 in the Appendix, shows the predictive distribution for good 1, constructed based on the data for these two subjects for four randomly generated environments. For some of these new economic environments, the data indeed provides distinct information to predict behavior –Panels 1.16a and 1.16b– while for others the different is subtle or almost none -Panels 1.16c and 1.16d–. Table 1.7 shows that, even when standard goodness of fit measures and adjusted ones35 would indicate that the performance of the model is better for subject 313, subject 516 generates a more informative predictive distribution. This is because even if ex ante the design seems to be more demanding for subject 313, a closer look reflects that, given observed choices, more information can be inferred from choices for subject 516, as can be seen from the comparison of the mean size of the supporting set. 35 In the spirit of the measure proposed by (Beatty and Crawford 2011), but considering the mean ex-ante probability of detecting probability instead of the overall probability to capturing differences in experimental design across individuals. 57 1.5.3 Model Comparison: GARP vs. Parametric Specification I use the predictive approach for the comparison between the nonparametric rationality model and a nested parametric model where choices are generated by a Cobb-Douglas utility function plus error. The parameters that characterize the preferences of each decision maker are allowed to be individual specific. By assuming a parametric DGP that delivers a point prediction, one may expect the predictive distribution for the Cobb-Douglas model to be more informative than for a more general model (GARP) that comprises it. But in being more restrictive, the para- metric model induces misspecification error which increases its entropy. In the consid- ered sample, the latter effect dominates, as can be seen from the results in table 1.8, and the difference is statistically significant. The only case in which the parametric model outperforms GARP is for subject 603, given her irregularities36 throughout all measures of behavior has not been taken into account in the empirical analysis performed in previous sections. 1.6 Relation to the Literature This chapter relates to two large bodies of literature: (i) revealed preference testing, measures of fit and power adjustments for GARP and (ii) model comparison and model selection. By proposing the use of the predictive distribution to measure the success of a nonparametric model of utility maximization while allowing for error. To my knowl- edge, the construction of such distribution is novel in the literature. (Adams 2013) allows for error, but identifies only a point forecast estimate by applying the "minimum discrimination information principle". The approach developed in this chapter identi- fies instead a predictive distribution, which is both more informative and can be used 36 Varian Index < 0.3 58 to measure the quality of the model. Microeconomic theory has approached the problem of assessing the quality of the model when observed choices are not perfectly rational by proposing goodness of fit measures based on some intuitively appealing moment of the data associated with the monetary cost of the departures from the model. Section 1.4.4 discusses the relation between the predictive performance approach developed in this chapter with the most salient approaches for goodness of fit in the literature. A common problem that plagues the literature of goodness of fit is the effect of the relative distribution of budget sets on the ability to detect departures from rationality. The literature has approached the problem by studying the "power" of the rational- ity test; understood as the probability of finding violations given observed economic environments for some alternative behavioral process. (Beatty and Crawford 2011) proposes a measure of predictive success that combines fit and power following the ax- iomatization provided by (Selten 1991), the relation between this measure and the ap- proach presented in this chapter is shown in Section 1.4.4. In a similar vein, (Hoderlein and Stoye 2009) and (Andreoni and Harbaugh 2013) tackle this problem by propos- ing alternative measures of power. The predicting precision measures proposed in this chapter combine fit and power by sizing the informational content of the predictive dis- tribution providing a meaningful trade off between the two. This chapter uses a projection technique to identify the systematic component and the error process from data. Others have proposed the use of similar techniques as (Varian 1985) and (Fleissig and Whitney 2005), but I further impose feasibility con- straints. Under the behavioral assumption that choices are rational but observed with error it naturally follows that the rational component must lie in the set of feasible 59 GARP CD Difference| CI for Difference PP CI 0.5140 0.2436 0.2702 0.2406 0.2999 (0.0059) (0.0157) (0.0149) t-statistic 18.14 PP KL 0.3422 0.2552 0.0870 0.0609 0.1131 (0.0104) (0.0094) (0.0131) t-statistic 6.64 " )k kvec(b 558.49 20504.6 -19946.1 -24466.2 -15246.1 (187.67) (2301.9) (2271.3) t-statistic -8.78 Table 1.8: Model Comparison. Comparison GARP vs a Cobb Douglas utility function - Means choices; and by definition observed choices are feasible as well. (Halevy, Persitz, and Zrill 2014) proposes a projection technique that considers the feasibility constraints but it is based on a parametric assumption over the utility function–that is, the data is projected onto the subset of feasible choices that are consistent with the assumed parametric form, inducing misspecification error. The projection technique employed in this chapter only relies on the nonparametric constraints imposed by GARP. Finally, this chapter relates to papers in the model selection literature. The informa- tion criteria literature deals with trade-off between the goodness of fit of the model and the complexity of the model providing tools for model selection, see (Akaike 1998) and (Geisser and Eddy 1979). (Geisser and Eddy 1979) deals with model comparison in the context of parametric models, Section 1.4.2 shows how the predictive distribution proposed in this chapter can be used for such constructions. 1.7 Conclusion Rationality is one of the most prevalent assumptions in economics but empirically is almost surely violated. GARP provides an elegant nonparametric test for rationality, but only one violation causes the data to be declared as inconsistent. The aim of this chapter is to assess the quality of these models when the data may not pass the de- terministic test relying solely on the necessary and sufficient conditions provided by the axiomatization. I develop a framework that, combining fit and the uncertainty to 60 predict consistent behavior, gauges the quality of the model by its predictive ability given observed data. By extending the nonparametric prediction for GARP provided by (Varian 1982) to a allow for error, I provide a novel construction for the predictive distribution of axiomatic models. This distribution is a sufficient statistic for a decision maker that seeks to select a model to predict behavior. For a researcher that does not have an application in mind the predictive distribution can be used to compute mea- sures of the performance of the model based on marginal likelihood and two intuitive statistics that summarize the predictive performance of the model. This approach provides a joint measure of the quality of the model-data, therefore allowing for the comparison across data sets and/or behavioral models. The tradeoff between these two features of the model is made conditional on observed data, there- fore outperforming other measures of fit prevalent in the literature. Other measures that combine fit and power in the literature do not allow to differentiate among many model-data sets since power adjustments rapidly become insignificant. In contrast, the predictive precision measures allow to differentiate between cases that seemingly have the same power, but deliver significantly different predictive distributions. The proposed predictive precision measures exhibit, theoretically and empirically, the desired properties. Noisier data performs worse and more demanding environments deliver more informative predictive distributions. Moreover, increasing the number of observations allows the researcher to extract more information from data, which results in more precise predictions. By combining fit and power, the predictive precision measures are more informative than standard measures in the literature. By relying solely on the axiomatization pro- vided by GARP, without any further assumption on behavior and allowing for individual 61 specific behavior, I avoid misspecification biases common in the recoverability/demand estimation literature. By imposing the constraints implied by the model when projecting the data, I overcome the concerns raised by (Lewbel 2001) with respect to the common econometric approaches. Finally, the framework proposed in this chapter allows for model comparison featuring a natural trade off between the generality of the model and its empirical accuracy. 62 1.A Proofs 1.A.1 Proofs of Section 1.3 Proof of Theorem 1.11. (1) From Definition 1.10 the predictive distribution, condi- tional on data D is given by   1 (x − s) · p0 = 0 Z 0 1   fMD ( y) = R f" (x − s) 1 y = ar gmin ˜y >>0 k ˜y − xk R ds s∈S 0 (Ò mMD ) s∈S 0 (Ò mMD ) ds f (x x:(x−s)·p0 =0 " − s)d x (1.5) 0 Given assumption 1, the distribution of YMD on B(¯s, δ) ⊂ B 0 with Z s ¯s = R ds S 0 (Ò mMD−A) S 0 (Ò m ) ds MD−A 0 0 is symmetric around ¯s, and its mean is precisely ¯s. Then, given that YMD−A and YMD−B , only differ on the variance of the error process, for all δ > 0 such that B(¯s, δ) ⊂ B 0  0   0  E YMD−A 0 |YMD−A ∈ B(¯s, δ) = E YMD−B 0 |YMD−B ∈ B(¯s, δ) = ¯s (1.6) Furthermore, distribution 1.5 truncated to B(¯s, δ) is given by Z 1 1 1 0 fMD ( y| y ∈ B(¯s, δ)) = f" ( y − s) ds (1.7) 0 ( y ∈ B(¯ P fMD s, δ)) S A(s) s∈S 0 (Ò m MD ) R R where A(s) ≡ x:(x−s)·p0 =0 f" (x − s)d x and S ≡ s∈S 0 (Ò mMD ) ds. Note that, before the trun- cation the predictive distribution is a mixture of distributions with same variance but different mean, where the weighting scheme is dictated by the uniform distribution on the supporting set. Therefore, the variance of the mixture is given by a weighted av- erage of each of the mixing components, that is the variance of the error process, plus the variance of the distribution for location parameters, which in this case would be the variance of S ∼ US 0 . 63 0 By construction, the density of YMD is symmetric around ¯s before the truncation, and it is also symmetric for any B(¯s, δ), such that B(¯s, δ) ⊂ B 0 . Moreover, by convexity of the supporting set, it follows that the variance of the symmetric truncation is a func- "A)k > kvec(ˆ tion of the overall variance and the result follows, that is kvec(ˆ "B )k, then Dia g (ΣA) ≥ Diag (ΣB ) and the result follows. (2)From previous point we have that the effect of a bigger supporting set on the untruncated predictive distribution is through the variance the distribution of the lo- cation parameter on the supporting set. By the properties of the supporting set it fol- mMD−B ) with S (Ò mMD−A) ⊃ S 0 (Ò mMD−B ) then 0 lows that for SA ∼ US 0 (Ò mMD−A) , and SB ∼ US 0 (Ò Var(S A) > Var(S B ). Therefore the variance of the predictive distribution before the truncation, depends on the variance of the location parameter, therefore on the size of the supporting set. (3) and (4) follow by Definition 1.10, where the convergence of the distribution to the distribution of the error process follow by standard consistency properties of the least squares estimation when the truncation is not binding (q >> 0). 1.A.2 Proof of Section 1.4.3 Proof of Theorem 1.19. From theorem 1.11 we know that the untruncated distribu- tion increases its variance when variance of the error process or the size of the support- ing set increases. Then, by definition 1.15, the P Pα is defined as the smallest and con- nected set C α such that PFMD α 0 ( y ∈ C ) ≥ 1−α. Moreover, given convexity of the support- R sds ing set, this set would contain s = , for all α < α ¯ with α ¯ = 1 − max y∈B0 \R++ 0 ( y) . Rs∈S L 0 ds fMD s∈S 0 Then (1) and (2). Finally, for two point identify models, the predictive distribution converges to the 64 distribution of the error process, Theorem 1.11, truncated due to feasibility constraints. Òn = m As in (1), conditionally in the supporting set, in this case given m ˜ m ; if kvec(")k < Ò ˜ 0 increases. Defining " )k then the variance of the untruncated distribution for MD kvec(˜ α ¯ as above, the result follows. 1.A.3 Proofs of Section 1.4.4 Proof of Proposition 1.21. 1. Let QA ⊂ {1, . . . , J} be the maximal subset of choices that is consistent with the kQAk  model, that is qQA ∈ M B {p i , x i }i∈QA , then H M (qA) = J . H M (qA) > H M (qB ) ⇒ QA > Q B , then it must be the case that "bAi · "bAi < "bBi · "bBi , and the results follows from Theorem 1.19. 2. Af r iat(qA) > Af r iat(qB ) ⇒ ∃ j ∈ {1, . . . , J}. Let eA = Af r iat(qA) and assume, without loss of generality that the perturbed revealed preference that imposes the cheapest taxation37 on the value of e the efficiency index is such that eA = j j ar gmax e ep i · qAi ≥ p i · qA > ar gmax e ep i · qBi ≥ p i · qB , that is if and only if eA ≡ j j p i ·qA p i ·q j j > eB ≡ pi ·qBi which given local non satiation we get p i · qA > p i · qB → p i · p i ·qAi B j j  qA − qB > 0, which in turn implies that "bAi · "bAi < "bBi · "bBi , and the results follows from Theorem 1.19. 3. −M P I(qA) > −M P I(qB ) if and only if there exists a sequence of choices that involves observation q i and constitutes a cycle, namely Ji ⊆ {1, . . . , J} with i ∈ Ji .  / MB {p j , x j } . Moreover, from the definition of the M P I implies That is, {q j } j∈Ji ∈ that for at least one observation k 6= i in Ji if and only if, p k ·(qAk −qAi )+p i ·(qAi −qAk ) < p k · (qBk − qBi ) + p i · (qBi − qBk ), which in turn implies that     qAi · p i − p k < qBi · p i − p k ⇔ qBi − qAi · p i − p k > 0 37 The argument follows if q j R(e)q i 65 which in turn implies that "bAi · "bAi < "bBi · "bBi , and the results follows from from Theorem 1.19. 1.B Error Process 1.B.1 Algorithms - Estimation of the Error Process in Finite Samples Algorithm 1.26 (Simulation of F"θ by simulating M sequentially) Input: Data D, pro- Ò and θ jection m Output: vector e2 ∈ R M ×L where M is the number of repetitions for the simulation. 1. Choose M and set, m = 1 2. Find a randomize order I m and set i = 1 ” € m(i) m(i) Š — 3. Take I m(i) and draw z Imm(i) ∼ U Supp p I , x I |Ò m−I m(i) 4. Draw e m j from F"θ for j = 1, . . . , J, and compute y m = z m j + em 5. If y m ∈ B(p, x), set q m = y m and go to 6. Otherwise compute q m from y m consis- tently with equation 1.2 2 6. If q m ∈ M then set em = 0, m = m + 1 and i = i + 1. Otherwise compute em 2 = min {q m − m Ò, q m − z m } 7. Repeat steps 3-6 M times Algorithm 1.27 (Computation of σ"2 assuming q b = m) Input: choice data q, observed  J q and σ2 economic environments p i , x i i=1 , projection b Output: vector e1 ∈ R M ×L where M is the number of repetitions for the simulation. 66 1. Choose M and set, m = 1  2. Draw e m j from N 0, σ 2 for j = 1, . . . , J, and compute y m = b q + em 3. If y m ∈ B(p, x), set q m = y m and go to 4. Otherwise compute q m from y m consis- tently with equation 1.2 1 4. If x m ∈ M(B(p, x)) then set em = 0 and m = m + 1. Otherwise compute em 1 = qm − b q 5. Repeat steps 2-4 M times Algorithm 1.28 (Computation of σ"2 by simulating M(B(p, x ))) Input: data on choices  J q and σ2 q, observed economic environments p i , x i i=1 , projection b Output: vector e3 ∈ R M ×L where M is the number of repetitions for the simulation. 1. Simulate the rational choice set (a) Choose N , and set n = 1 i = 1, and set J N = ; (b) Choose j at random from the set J J = {1, . . . , J} \ J N and set I n(i) = j (c) Draw z Inn( j) ∼ U [Supp (B(p, x) I n(i) ) |b qJ J ] (d) Set J N = J N ∪ I n(i) and repeat steps 1b-1c until J J = ; n (e) Set Msim (B(p, x)) = z n 2. Simulate the distribution of the Q statistic (a) Choose M and set, m = 1 (b) Choose z m at random from the set Msim (B(p, x))  (c) Draw e m j from N 0, σ 2 for j = 1, . . . , J, and compute y m = z m j + em 67 (d) If y m ∈ B(p, x), set q m = y m and go to 2e. Otherwise compute q m from y m consistently with equation 1.2 3 (e) If q m ∈ M(B(p, x)) then set em = 0 and m = m + 1. Otherwise compute e = qm − b q (f) Repeat steps 2b-2e M times 1.B.2 Performance of Algorithms 1.26, 1.27 and 1.28 for Simulated Data Algorithms - σ"2 - DGP 1 Algorithms - σ"2 - DGP 2 σ 1 2 3 1 2 3 0.5 0.8850 0.9040 0.5370 0.9470 0.9570 0.5905 1 0.7405 0.7618 0.1794 0.8287 0.8816 0.2901 2 0.3128 0.4052 0.3592 0.4016 0.5528 0.2444 3 0.5876 0.6251 0.6460 0.4786 0.5149 0.5313 5 0.8631 0.9218 0.8087 0.8192 0.8918 0.7196 10 0.9765 0.9808 0.8497 0.9775 0.9794 0.8024 Table 1.9: Simulations algorithms. Kolmogorov-Smirnov two tailed test statistics with respect to measure constructed based on the projection for σ = 3 for two alternative distributions over a CD utility function. 68 1.C Tables 1.C.1 Data Table 1.10: Summary statistics. Summary statistics for the proposed measures and other standard measures from the literature applied to the experimental data from (Choi, Fisman, Gale, and Kariv 2007). Measure Mean Median Std. dev. Min Max Considering all the observed Budget set N = 81 Rational? 0.1975 0.0000 0.4006 0.0000 1 Afriat 0.9568 0.9810 0.0642 0.6860 1.0000 Varian 0.8898 0.9440 0.1568 0.2290 1.0000 HM index 47.1235 48 3.2418 29 50 # violations WARP 4.8889 4.0000 5.3852 0 27 # violations GARP 43.2099 6.0000 104.2718 0 559 Mean ex-ante prob viol 0.1108 0.0984 0.0731 0.0028 0.3775 Raw R2 0.9906 0.9994 0.0214 0.8597 1.0000 Adjusted R2 0.1072 0.1023 0.0640 0.0009 0.2806 Weighted R2 0.9906 0.9994 0.0207 0.8720 1.0000 Size supporting 0.5231 0.5267 0.0538 0.3693 0.6415 PP CI (95%) 0.5141 0.5106 0.0533 0.3835 0.6358 PP KL 0.2458 0.2236 0.0831 0.1432 0.6064 PP Hellinger 0.3229 0.3179 0.0818 0.1751 0.6006 PP TV 0.4102 0.4001 0.0590 0.2977 0.5574 Raw R2 , adjusted R2 and weighted R2 are defined as in 1.29, 1.30 and 1.31 respectively. P P Hellinger is defined consistent with 1.16 using the Hellinger distance between the 1 p p 0 two distributions, P Pin f o−H el l |D = p F0 − 2 P 2 . P P TV uses the Total Variation 0 metric, P Pin | = 12 kF 0 − Pk1 f o−T V D 69 Table 1.11: Summary statistics - half of the sample. Summary statistics for the proposed measures and other standard measures from the literature applied to the experimental data from (Choi, Fisman, Gale, and Kariv 2007). Measure Mean Median Std. dev. Min Max Considering half of observations per individual N = 76 Rational? 0.6316 1.0000 0.4856 0.0000 1 Mean ex-ante prob viol 0.0979 0.0791 0.0833 0.0028 0.4704 Raw R2 0.9968 1.0000 0.0177 0.8516 1.0000 Size supporting 0.6315 0.6317 0.1015 0.0000 0.8091 PP CI (95%) 0.4245 0.4285 0.0639 0.2752 0.6515 PP KL 0.1686 0.1640 0.0478 0.0856 0.4139 PP Hellinger 0.2349 0.2237 0.0930 0.0894 0.6910 PP TV 0.3157 0.3179 0.0729 0.1653 0.6196 Raw R2 , adjusted R2 and weighted R2 are defined as in 1.29, 1.30 and 1.31 respectively. P P Hellinger is defined consistent with 1.16 using the Hellinger distance between the 1 p p 0 two distributions, P Pin f o−H el l |D = p F0 − 2 P 2 . P P TV uses the Total Variation 0 metric, P Pin | = 12 kF 0 − Pk1 f o−T V D Mean - Odd Half Mean - First Half Difference |Test-statistic| P-value R2 0.9978 0.9967 -0.0011 0.4927 0.6236 (0.0060) (0.0177) (0.0022) Adj R2 0.1139 0.0965 -0.0174 1.2401 0.2188 (0.1069) (0.0663) (0.0140) PA CI (95%) 0.4202 0.4255 0.0053 0.8371 0.4052 (0.0542) (0.0639) (0.0064) Table 1.12: Comparison - half of observations. Robustness check for the comparison between full sample and first half of the observations. The table shows that results are not significantly different when considering the first 25 observations or the odd observations Alternative Goodness of Fit Measures (Blundell, Browning, and Crawford 2008) 70 proposed that when allowing for a stochastic component when conciliating the data with the revealed preference conditions for utility maximization, the measure of the distance between data and restricted estimators of demand provide a natural formula- tion for a test statistic for the null of rationality. Alternative, one can construct goodness of fit measures as a measure of the distance of data from the models. I propose to con- struct: (1) a raw goodness of fit, (2)an adjusted goodness of fit and (3) a weighted goodness of fit. The raw goodness of fit is defined as the coefficient of determination constructed from the residuals of the projection. Formally, Definition 1.29 (Raw R2 ) Let "b be defined as the residuals from the projection exercise then " ) − vec(ˆ kvec(ˆ " )k R2l = 1 − (1.8) kvec(q) − vec(q)k As other measures of fit proposed in the literature, the empirical fit of the rationality model is tightly related to the ex-ante probability of detecting violations to the model given the observed economic environments and not only noisy, and the former is par- ticularly significant in small samples. Them I propose to adjust the measure of fit the ex ante probability of detecting violations to the model given observed economic envi- ronments. These measure is closely related to (Beatty and Crawford 2011), though the considered measure here is the product of the two factors instead of the difference, and uses the mean ex-ante probability instead the overall probability. If I were to consid- ered the overall probability for the considered sample the adjustment is approx. 1 for all subjects. Formally, Definition 1.30 (Adjusted measure of goodness of fit) Define the adjusted measure of goodness of fit as 2 R ≡ Adjusted Goodness of fit ≡ Goodness of fit × P r[Violation model] (1.9) 71 where Goodness of fit is defined as in 1.30 and P r[Violation model] is defined as the mean of the ex-ante probability of detecting a violation to the model given observed economic environments. Finally, the weighted measure of fit is a measure of fit where each of observation is weighted by how unlikely (ex-ante) is to observed such behavior given observed economic environments; that is, Definition 1.31 (Adjusted measure of goodness of fit - Weighted R2 ) If there exists a j such that p j 6= 0, then the weighted measure of goodness of fit is given by € Š " ) − vec(ˆ kω vec(ˆ ") k R2l,wei ght ed = 1 − R2l = 1 − (1.10) kω (vec(q) − vec(q)) k otherwise it is equal to zero. The weights are given by Definition 1.32 (Weights) Consider the weight function ω such that   pj if "bj = 0 ω ej = (1.11)  1− p if j "bj 6= 0 ω  and ω j = ej PJ with p j ≡ P r violat ion j|− j |q1 , . . . , q j−1 , q j+1 , . . . , qJ j ω e Table 1.13: Correlations. Data for 81 subjects, (Choi, Fisman, Gale, and Kariv 2007). (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (1) PP CI (95%) 1 (2) PP Hellinger 0.8789* 1 (3) PP TV 0.9469* 0.9694* 1 (4) Raw R2 0.2227* 0.1212 0.0806 1 (5) Adjusted R2 0.0868 0.1235 0.0831 0.1777 1 (6) Ex ante prob viol 0.1263 0.1928 0.1536 0.0438 0.1674 1 (7) # viol WARP -0.3979* -0.3623* -0.3021* -0.7100* -0.1073 -0.1741 1 (8)# viol GARP -0.2469* -0.1796 0.1236 -0.7321* -0.0742 0.0038 0.7797* 1 (9)Afriat 0.2455* 0.1690 0.1125 0.8635* 0.1202 0.0496 -0.8126* -0.8642* 1 (10)Varian 0.2440* 0.1696 0.1260 0.8360* 0.1153 0.0428 -0.7984* -0.8552* 0.9323* 1 (11)HM 0.3259* 0.2771* 0.2081 0.4642* 0.1576 0.1963* -0.7825* -0.7215* 0.6205* 0.5531* 1 (12)Size supporting -0.9428* -0.4293* -0.8344* -0.9579* 0.2354* -0.0148 -0.0254 -0.0538 -0.2092 0.2352* 0.1733 0.1283 72 73 1.C.2 Comparison Between Subjects 313 and 516 Figure 1.15: Comparison between subjects Data from Table 1.7 (a) Subject 516 (b) Subject 313 74 (a) p0 = (6.41, 2.05) and x 0 = 102.51 (b) p0 = (2.54, 5.33) and x 0 = 97.13 (c) p0 = (2.65, 0.66) and x 0 = 95.67 (d) p0 = (8.61, 39.38) and x 0 = 103.46 Figure 1.16: Predictive densities - comparison. Predictive densities for Subjects 313 and 516 where B 0 where randomly generated. Chapter 2 Satisficing and Stochastic Choice - with Victor Aguiar and Mark Dean 2.1 Introduction People often do not pay attention to all the available alternatives before making a choice. This fact has lead to an extensive recent literature aimed at understanding the observ- able implications of models in which the decision maker (DM) has limited attention.1 In an important recent paper, (Manzini and Mariotti 2014) characterize the stochastic choice data generated by a decision maker (DM) who has standard preferences, but only notices each alternative in their choice set with some probability. The chosen item is therefore the best alternative in the ‘consideration set’ of noticed items, which may be a strict subset of the items which are actually available. The idea that a DM may not search exhaustively through all available alternatives is not new. (Simon 1955) introduced the concept of satisficing: an intuitively plausible choice procedure by which the DM searches through alternatives until they find one that 1 Notable examples include (Masatlioglu, Nakajima, and Ozbay 2012), (Caplin, Dean, and Martin 2011), (Eliaz and Spiegler 2011), and (Salant and Rubinstein 2008). 75 76 is ‘good enough’, at which point they stop and choose that alternative.2 This model has been hugely influential, both within economics, and in other fields such as psychology ((Schwartz, Ward, Monterosso, Lyubomirsky, White, and Lehman 2002)) and ecology ((Ward 1992)). Despite the popularity of the satisficing model, testing its predictions can prove challenging. It has long been known that standard choice data, which records only the choices made from different choice sets, cannot be used to disentangle satisfic- ing from utility maximization (see (Caplin, Dean, and Martin 2011) for a discussion). Researchers have therefore typically resorted to richer data sets in order to test the satisficing model. for example, (Caplin, Dean, and Martin 2011) make use of ‘choice process’ data, which records the evolution of choice with decision time, while (Santos, Hortacsu, and Wildenbeest 2012) use the order in which alternatives were searched as recorded from their internet browsing history. In this chapter, we characterize the observable implications of the satisficing choice procedure for stochastic choice data. Such data has been heavily studied in the eco- nomics literature.3 We assume that the DM has a fixed utility function and satisficing level. In any given choice set, they search sequentially until they find an alternative which has utility above their satisficing level, at which point they stop and choose that alternative. If they search the entire choice set and do not find a satisficing alterna- tive then they choose the best available option. We assume that search order varies randomly, leading to stochasticity in choice. On the one hand, this chapter is related to the work of (Manzini and Mariotti 2014) (henceforth MM). It specifies a procedure by which attention is allocated, while MM is agnostic in this regard. On the other, 2 (Caplin, Dean, and Martin 2011) show that satisficing behavior can be optimal under some circum- stances 3 See for example (Block and Marschak 1960, Luce and Suppes 1965, Falmagne 1978, Gul and Pesendorfer 2006, Gul, Natenzon, and Pesendorfer 2014, Manzini and Mariotti 2014) 77 it provides an alternative test of the satisficing model to that of (Caplin, Dean, and Martin 2011) and (Santos, Hortacsu, and Wildenbeest 2012), using a data set which is readily available in many settings. Our main observation is that the satisficing model implies that choice is stochastic only in choice sets where there are multiple alternatives above the satisficing level. If this is the case, then the order of search will affect the chosen alternative. If not, then either the choice set will be fully searched and the best option deterministically chosen, or the single satisficing alternative will always be chosen. This allows us to behaviorally identify the alternatives that are satisficing for the decision maker. Without further restriction, any stochastic choice data set can trivially be made com- mensurate with the satisficing choice procedure by assuming that all alternatives are above the satisficing level, and the resulting distribution of choices reflects the distri- bution of search orders in that choice set. In order to generate meaningful behavioral implications, we must place further restrictions on the satisficing model. For our main theorem we make the assumption that the distribution of search orders has a full support property (i.e., each item has a positive probability of being searched first), and also rule out the possibility of indifference. This allows us to identify the set of above-reservation alternatives and characterize satisficing with two simple intuitive conditions. The first states that choice can be stochastic only amongst elements that are always chosen (with some probability) when available. The second says that revealed preference, defined via the support of the random choice rule in each set, must satisfy the Strong Axiom of Revealed Preference (SARP). Under these conditions, the data will admit a satisficing representation and the satisficing set, utility function and distribution over search or- ders can be identified to a high degree of precision. 78 Our baseline specification puts no restrictions on the relationship between the dis- tribution of search orders across different choice sets. We next consider a refinement of the satisficing model in which the distribution of search order in each choice set is a manifestation of the same underlying search distribution. In order to guarantee such a representation we need a third axiom: the Total Monotonicity condition of (Block and Marschak 1960). This condition on its own is necessary and sufficient for the data to be commensurate with the random utility model (RUM). Thus, the fixed distribution satisficing model is the precise intersection between satisficing and random utility. We discuss three extensions to our results in which we relax the assumptions of full support, no indifference and the observation of a complete data set. We show that a satisficing model without full support, but with fixed distribution is equivalent to the random utility model. Allowing for indifference (but maintaining the full support as- sumption) is equivalent to dropping the requirement that stochasticity only take place amongst always chosen alternatives. If data is incomplete, our necessary and sufficient conditions are unchanged, but our ability to identify above satisficing elements is re- duced. We finishing studying the extensions to allow for random utility threshold and for random utility. The random utility threshold results in a model that is behaviorally equivalent to the baseline model with full support. We fully characterize the behavioral implications of the extension to allow for random utility and show that it does not pro- vide further information about the utility order among satisficing alternatives. Section 2.2 describes our set up. Section 2.3 characterizes the satisficing model. Section 2.4 considers the extensions described above, while section 2.5 discusses the related literature. 79 2.2 Set Up 2.2.1 Data We consider a finite abstract choice set X , and let D ⊆ 2X \ ; be the set of menus in which behavior is observed. We assume that data comes in the form of a random choice rule, p : X ×D 7→ [0, 1], which specifies for each menu A ∈ D the probability of choosing each element a ∈ A (for example, if the DM has a one third probability of choosing x from {x, y, z} then p(x, {x, y, z}) = 31 ). Definition 2.1 (Data set) A data set consists of a set if menus D ⊆ 2X \ ; and a random choice rule p : X × D 7→ [0, 1] such that a∈A p(a, A) = 1 ∀A ∈ D. We say a data set is P complete if D = 2X \ ;. Random choice rules have been heavily studied in the theoretical, as well as the applied literature.4 In practice, while a random choice rule is not directly observable, it can be estimated from observed choice frequencies, pooling either across repeated choices by the same individual, or by aggregating across the choices of different indi- viduals. 2.2.2 The Satisficing Model The satisficing choice procedure can be described as follows: when faced with a menu of options to choose from, the DM searches through the available alternatives one by one. If, at any point, they come across an alternative which is ‘good enough’, they stop searching and select that alternative from the menu. If they exhaustively search all al- ternatives without finding an element which satisfies their criteria, then they choose the best available alternative from the set. Note that the standard model of rational choice 4 Examples of early theoretical work include (Block and Marschak 1960) and (Luce and Suppes 1965). More recent work includes (Gul and Pesendorfer 2006, Manzini and Mariotti 2014, Gul, Natenzon, and Pesendorfer 2014). 80 is a limiting case of the satisficing model in which no alternative is ‘good enough’. As a concrete example, consider a DM searching for a book to buy in a bookshop prior to a flight. They examine the available books one by one, looking for one which satisfies their requirements (humorous, has good reviews, long enough to last the flight, not by Dan Brown). If they find such a book, they immediately go to the checkout and buy it. If they search the entire selection and don’t find a book which matches this cri- teria then they go back and choose the best of the books that they did see. The satisficing choice procedure therefore has three building blocks. The first is a fixed utility function u : X → R, which describes the preferences of the DM. Following (Manzini and Mariotti 2014), for our main results we rule out indifference, and there- fore assume that u is injective. We discuss the implications of allowing indifference in section 2.4.2. The second model element is a utility threshold u∗ , which we will refer to as the reservation utility. This defines the concept of ‘good enough’: an alternative x ∈ X is good enough if u(x) ≥ u∗ . We define U ∗ = {a ∈ X |u(x) ≥ u∗ } as the set of satisficing elements according to u and u∗ . For convenience, we will assume that there is at least one satisficing element: i.e. u∗ ≤ max x∈X u(x). This assumption has no behavioral im- plication: a model in which only the best available alternative is above the reservation utility is indistinguishable from one in which there is no such alternative. However, it will streamline the statement of identification results in section 2.3. The third element of the satisficing model is the order in which search occurs. A search order for a choice set A is defined by a linear order on that set.5 We use RA to 5 i.e. a complete, transitive and antisymmetric binary relation on A with the interpretation ‘searched no later than’. 81 denote the set of linear orders on A, with rA a typical element in RA. Our key assump- tion is that the order of search is determined stochastically: we use γA : RA → [0, 1] to denote the probability distribution over the set RA, which we call a ‘stochastic search order’. Abusing notation slightly, we will use γA(x rA y) to denote the probability of all search orders in which x appears before y: i.e. γA(rA ∈ RA|x rA y). We are agnostic about the source of this stochasticity. It could be that the DM ran- domly decides the order of search - in our example, sometimes the DM search through the books alphabetically, while sometimes they do so by genre. Alternatively, it could be that the random choice rule is generated by a DM who is faced by choice situations which are framed in different ways,6 with the framing unobservable to the researcher. For example, sometimes the bookstore puts the thrillers at the front of the store, while sometimes they put the romantic comedies at the front. These ‘frames’ affect the order in which the DM searches (though not their preferences), but are not known to the researcher. A data set can be represented by the satisficing model if there exists a utility func- tion, satisficing level and family of stochastic search orders which would generate the observed choice probabilities: Definition 2.2 (General Satisficing Model (GSM)) A data set (D, p) has a Generalized Satisficing Model (GSM) representation if there exists an injective u : X → R, u∗ ∈ R such 6 In the sense of (Salant and Rubinstein 2008). 82 Order (1) (2) (3) (4) (5) (6) 1st a a b b c c 2nd b c a c a b 3rd c b c a b a γA 1 12 1 6 1 3 1 24 1 8 1 4 Table 2.1: An example of the satisficing model that u∗ ≤ max x∈X u(x), and {γA}A∈D such that, for any A ∈ D and a ∈ A  γA (rA|a rA b ∀ b ∈ A\{a} s.t. u(b) ≥ u∗ ) if u(a) ≥ u∗       p(a, A) = 1 if a = arg max x∈A u(x) < u∗     0  Otherwise (2.1) To illustrate how the model works consider the following example. Example 2.3 Let A = {a, b, c} and γA be as displayed in table 2.3. Consider first the case where a, b are satisficing alternatives, while c is not, that is u(a), u(b) > u∗ > u(c). Then, no matter the search order, c will never be chosen, and so p(c, A) = 0. However, as the DM will chose a if a is seen before b and b otherwise, their frequency depends on 3 γA. In particular, p(a, A) = 8 and p(b, A) = 85 . If instead all alternatives are below the satisficing level (i.e. u∗ > max x∈Au(x)) then choice will be independent of search order: all alternatives will always be searched, and the best subsequently chosen. 83 2.3 Characterizing the Satisficing Model. 2.3.1 A Negative Result The aim of this chapter is to describe properties of a stochastic choice data set which are necessary and sufficient to guarantee a satisficing representation. However, our first result is negative: without further refinement, the GSM model provides no restriction on such a data set. Proposition 2.4 Any data set (D, p) has a GSM representation. All proofs are relegated to the appendix The GSM is flexible enough to match any data set because it places no restriction on the distribution over search orders in each decision problem. Thus one can always construct a distribution of search orders that will match the data by assuming that all alternatives are above the satisficing level. To derive testable restrictions for the satisficing model, we introduce a ‘full support’ condition on the distribution of search orders. This restrictions will allow us to iden- tify satisficing alternatives as those which are chosen with positive probability in every choice set in which they appear. We can then utilize the underlying structure of the GSM model to derive behavioral restrictions. Intuitively, the stochastic nature of search generates stochastic behavior among satisficing alternatives. In contrast, we expect to observe deterministic utility maximizing behavior among choice sets which consist only of non-satisficing alternatives. For the remainder of this section we concentrate on the simple case of full support, no indifference and complete data. We identify the behavioral conditions which charac- terize the resulting model. We also consider a special case of the model in which there 84 is consistency in the distribution of search orders between choice sets. This additional restriction ensures that our model is behaviorally equivalent to a subset of the class of random utility models (RUMs). In section 2.4 we discuss extensions in which we drop the full support, no indifference, and complete data conditions. 2.3.2 Full Support Satisficing Models Our model adds to the GSM the assumption that, in each choice set, any item will be searched first with some positive probability. Assumption 2 (Full Support) For any a ∈ A and all A ∈ D: γA(rA ∈ RA : a rA b ∀b ∈ A\{a}) > 0. We describe a GSM which additionally satisfies this assumption as a Full Support Satisficing Model (FSSM) Definition 2.5 (Full Support Satisficing Model (FSSM)) A data set (D, p) has a Full Support Satisficing Model (FSSM) representation if it has a GSM representation in which the stochastic search order satisfies Full Support. The assumption of Full Support has an important implication: we can identify above-reservation alternatives as those which are always chosen with positive prob- ability in any choice set in which they appear. This is because Full Support implies that, for each such alternative, a search order in which it is searched first occurs with pos- itive probability in each choice set, ensuring that it will be chosen. Furthermore, any alternatives that are not above reservation utility will be chosen with zero probability in any choice set which contains an above reservation utility alternative. We define the set of alternatives which are always chosen: Definition 2.6 (Always Chosen Set) For any data set (D, p), we define the always cho- sen set as W ∗ = {a ∈ X |p(a, A) > 0 for all A ∈ D such that a ∈ A}. 85 For any complete data set generated by a FSSM, W ∗ must be equivalent to the set of above-reservation alternatives. Lemma 2.7 Assume a complete data set (D, p) admits an FSSM representation. Then, for any such representation W ∗ = U ∗ . As we discuss in section 2.4.3, if the data set is not complete then W ∗ may be a strict superset of U ∗ : a below satisficing alternative may always be chosen because it is only observed in choice sets containing other below satisficing alternatives.7 Using the (observable) set W ∗ , we can define the first of two behavioral conditions which characterize the FSSM. It states that stochastic choice must only occur amongst elements of W ∗ . This follows from the fact that stochasticity in the satisficing model occurs only from stochasticity in search order. Axiom 2.8 (Deterministic no satisficing choice) If a ∈ X \W ∗ then either p(a, A) = 0 or p(a, A) = 1 for all A ∈ D. The second condition ensures that the preference information revealed by a data set is well behaved. In order to state the condition, we introduce the following definitions. Definition 2.9 (Stochastic Revealed Preference) Define C(A) = {a ∈ A|p(a, A) > 0}. a is stochastically revealed directly preferred to b if, for some A ∈ D a, b ∈ A and a ∈ C(A). a is stochastically revealed preferred to b if {a, b} is in the transitive closure of the stochastically revealed directly preferred relation. a is stochastically strictly revealed / C(A). preferred to b if, for some A ∈ D, a ∈ C(A) and b ∈ Notice that, for data generated by a FSSM, these revealed preference concepts align with the underlying utility function except in the case of two alternatives above that sat- isficing level. Such objects will be revealed indifferent to each other, yet may in fact be 7 Completeness can be replaced for a weaker condition on the richness of the data set which requires observing choices from all two and three element sets. 86 ranked according to the utility function. It is a defining feature of the satisficing model that utility differences above the threshold u∗ are unimportant for behavior. Neverthe- less, the FSSM implies that the stochastic revealed preference information must obey the Strong Axiom of Revealed Preference. Axiom 2.10 (SARP) C(A) must obey SARP: if a is stochastically revealed preferred to b then b must not be stochastically strictly revealed preferred to a. Our first result is that axioms Axiom 2.8 and Axiom 2.10 are necessary and sufficient for a data set to have a FSSM representation Theorem 2.11 The following are equivalent: 1. A stochastic choice dataset (D, p) is generated by a FSSM. 2. A stochastic choice dataset (D, p) satisfies Axiom 2.8 and Axiom 2.10. To understand the sufficiency of the two axioms -Axiom 2.8 and Axiom 2.10-, note first that SARP allows us, through Afriat/Richter’s theorem (Richter 1996), to con- struct a utility function which represents the stochastic revealed preference relation. Moreover, the elements of W ∗ will be maximal according to that utility representation, allowing for a u∗ such that all elements of the always chosen set can be assigned a utility greater or equal than u∗ ; while all the elements that are not always chosen are assigned an utility level below u∗ . Axiom 2.8 guarantees deterministic choice in sets which contain at most one above-reservation alternatives, and SARP again ensures that such choices are utility maximizing. For all other choice sets, Axiom 2.8 ensures that alternatives with utility below u∗ (and so outside W ∗ ) are not chosen, and a suitable stochastic search order can be constructed from the random choice rules to explain the pattern of choice amongst above satisficing alternatives. 87 Notice that, while Lemma 2.7 relies on the completeness of the data set, Theorem 2.11 does not. The behavioral content of the FSSM model is the same regardless of whether the data set is complete. However, the degree to which elements of the rep- resentation can be identified will be reduced in incomplete data sets, as we discuss in section 2.4.3. The following examples illustrate the empirical content of the FSSM by presenting data sets which violate each of our axioms. Example 2.12 (Violation of Axiom 2.8) Let X = {a, b, c}, and let p (a, {a, b}) = 1, p (b, {b, c}) = 12 , p (a, {a, c}) = 1 and p (a, {a, b, c}) = 1. This does not satisfy Axiom 2.8 1 since W ∗ = {a}, but p (b, {b, c}) = 2 / {0, 1}. This behavior is incommensurate with the ∈ FSSM because the fact that b is chosen probabilistically from {b, c} indicates that it must be above the satisficing level, yet this means that it should be chosen some of the time from {a, b, c} due to the full support assumption. Example 2.13 (Violation of Axiom 2.10) Let X = {a, b, c}, and let p (a, {a, b}) = 1, p (b, {b, c}) = 1, p (a, {a, c}) = 0 and p (a, {a, b, c}) = 1. This does not satisfy Ax- iom 2.10 since p (a, {a, b}) = 1 means that a is stochastically strictly revealed preferred to b, p (b, {b, c}) = 1 means that b is stochastically revealed preferred to c, while p (c, {a, c}) = 1 means that c is stochastically strictly revealed preferred to a . Such behavior is also in- commensurate with the FSSM as, in each case, the uniquely chosen object must have a utility strictly higher than those which are not chosen, either because all are below the satisficing level, in which case the best option is chosen, or because only the chosen object is above the satisficing level. Theorem 2.11 shows the extent to which the FSSM can be tested and differentiated from other models. First, note that any data set in which p(a, A) > 0 for all A ∈ D will trivially satisfy both Axiom 2.8 and Axiom 2.10, and so admit an FSSM representation. 88 This is because any data set in which the random choice rule has full support in ev- ery choice set can be rationalized by an FSSM in which every alternative is above the satisficing level, and the resulting pattern of choice is driven by the choice-set specific distribution over search orders. Second, note that the standard model of utility maxi- mization (without indifference) is a limit case of the FSSM in which |W ∗ | = 1. Third, notice that an alternative interpretation of the FSSM is a model in which attention is complete and choices are governed by a preference relation which has indifference only amongst maximal elements. A model in which such indifference is resolved using a ran- dom tie breaking rule with full support amongst maximal elements is equivalent to the FSSM. Recoverability in the FSSM In the case of a complete data set which satisfies Axiom 2.8 and Axiom 2.10, many of the elements of the FSSM can be uniquely identified Theorem 2.14 Let (D, p) be a complete data generated by an FSSM (u, u∗ , {γA}A∈D ). For any FSSM representation of the data (¯ γA}A∈D ) ¯∗ , {¯ u, u 1. U ∗ = U ¯∗ / U ∗ , u(a) > u(b) ⇒ u 2. For all a, b ∈ ¯(a) > u ¯(b) 3. γA(arA b ∀b ∈ {A ∩ U ∗ } \ {a}) = γ ¯A(arA b ∀b ∈ {A ∩ U ∗ } \ {a}) for all A ∈ D, a ∈ A ∩ U∗ Theorem 2.14 tells us that, in a complete data set we can uniquely identify the above-satisficing elements, the preference ordering over non-satisficing elements, and the probability that one satisficing element will be seen before another in any choice set. 89 2.3.3 Fixed Distribution Satisficing Models So far, we have allowed stochastic search order to vary arbitrarily between choice sets: an alternative that is likely to be searched first in choice set A may be very unlikely to be searched first in choice set B. However, in some cases such an assumption may be inappropriate. For example, consider the case in which the probability of search is governed by the ‘salience’ of different alternatives: a book with a bright pink cover may be more likely to be looked at before one with a dark brown cover regardless of the set of available alternatives. We now consider the implications of a satisficing model with full support in which the probability distribution over search orders is invariant to the set of available alter- natives. We call this the ’fixed distribution’ property. Assumption 3 (Fixed Distribution) There exists a ΓX : R X → [0, 1] such that, for every A ∈ D and rA ∈ RA γA(rA) = ΓX (rX |rA ⊂ rX ) For every choice set A, it is as if the DM draws a search order from a distribution ΓX over linear orders on the grand set of alternatives X . They then follow that search order, ignoring any alternatives that are not in fact available in A. Definition 2.15 (Fixed Distribution Satisficing Model (FDSM)) A data set (D, p) has a Fixed Distribution Satisficing Model (FDSM) representation if it has a FSSM represen- tation in which the family of stochastic search orders {γA}A∈D satisfy Fixed Distribution. The conditions Axiom 2.8 and Axiom 2.10 are necessary for FDSM but not suffi- cient, implying that the FDSM is a strict subcase of FSSM. In order to obtain sufficiency, 90 we make use of the Total Monotonicity condition of (Block and Marschak 1960). Total Monotonicity by itself is a sufficient and necessary condition for Random Utility Maxi- mization in our environment. This implies that the FDSM is the exact intersection of the FDSM model with the Random Utility model of Block-Marschak and (Falmagne 1978). In order to define the total monotonicity condition, we first need to define the fol- lowing function for each A ∈ D and a ∈ A: X f (a, A) = (−1)|D\A| p(a, D). D∈B(A) where B(A) is the class of supersets of A (i.e., B(A) ≡ {D ∈ D|A ⊆ D}). (Block and Marschak 1960) and (Falmagne 1978) proved that the following behav- ioral axiom called Total Monotonicity (or Block-Marschak Monotonicity) is necessary and sufficient for a RUM representation: Axiom 2.16 (Total Monotonicity) f (a, B) ≥ 0 for all a ∈ X , for all B ∈ D. Observe that f depends only on the data set so this axiom is testable. Note that Total Monotonicity implies ‘standard’ monotonicity: the probability of choosing any given alternative falls as more alternatives are added to the choice set - that is p(a, A) ≥ p(a, B) when A ⊆ B.8 However, it is also stronger than this condition as we can see in the following example: Set X = {a, b, c, d}, let p(a, {a, b}) = 0.2, and p(a, {a, b, c}) = p(a, {a, b, d}) = 0.19 and p(a, {a, b, c, d}) = 0.17. We check f (a, {a, b}) = p(a, {a, b}) + p(a, {a, b, c, d}) − [p(a, {a, b, c}) + p(a, {a, b, d})] and ob- serve that f (a, {a, b}) = −0.01 negative and violating total monotonicity. However, we 8 To see this take the Mobius inverse representation of p(a, A) = D∈B(A) f (a, D) and p(a, B) = P (a, 0 ) (a, P 0 D ∈B(B) f D and note that if D ∈ B(B) then D ∈ B(A) and f D) ≥ 0 by total monotonicity, then we have that p(a, A) ≥ p(a, B). 91 can see that standard monotonicity holds in this example. Clearly, a FDSM cannot lead to a failure of regularity. If it did that would mean that a given satisficing item is more likely to be found first in a bigger menu than in a smaller one, which is not consistent with the idea that the probability of any search order is fixed across menus. The higher order monotonicity conditions implied by total monotonicity can be interpreted as saying that the likelihood of a satisficing item being found first decreases with the size of the menu, but the marginal effect of adding a new option to the menu decreases with its size. In the example above f (a, {a, b}) ≥ 0 means that the impact of adding one additional item c in the menu {a, b} on the probability of choosing a – i.e. p(a, {a, b}) − p(a, {a, b, c}) – is bigger that the impact of adding the same item c in the bigger menu {a, b, d} on the probability of choosing a – i.e. p(a, {a, b, d}) − p(a, {a, b, c, d})). We are ready to state the main result of this section. Theorem 2.17 The following are equivalent: 1. A complete stochastic choice dataset (D, p) is generated by a FDSM. 2. A complete stochastic choice dataset (D, p) satisfies Axiom 2.8, Axiom 2.10 and Total Monotonicity (Axiom 2.16). Note that, unlike Theorem 2.11, Theorem 2.17 requires a complete data set. Recoverability in the FDSM In the case of a complete data set which satisfies Axiom 2.8, Axiom 2.10 and Total Monotonicity (Axiom 2.16) several of the elements of the FDSM can be identified. In particular, the identification of the search orderings is improved upon the FSSM recov- erability. 92 Theorem 2.18 Let (D, p) be a complete data generated by an FDSM (u, u∗ , ΓX ) such that X \ U ∗ 6= ;. For any FDSM representation of the data (¯ ¯∗ , Γ X ) u, u 1. U ∗ = U ¯∗ / U ∗ , u(a) > u(b) ⇒ u 2. For all a, b ∈ ¯(a) > u ¯(b) 3. ΓX (arX b) = Γ X (arX b) all A ∈ D and a, b ∈ U ∗ Theorem 2.18 tells us that, in a complete data set, we can identify the above- satisficing elements, granted that there is a positive probability of being seen before some above-satisficing element. Moreover, we can identify the preference ordering over the alternatives that are surely non-satisficing elements, and the probability that one revealed satisficing element will be seen before another in any choice set, for those elements that are surely satisficing. 2.3.4 Comparative Statics In this section we study the comparative statics of the model with respect to the primi- tives of the model. We first study the behavioral implications of changes on the utility threshold. If we observe two decision makers, I and ˜I , that face the same menus and are such that u∗I > u∗˜I and identical otherwise, then we expect to observed that set of always satisficing alternatives for decision maker I is weakly smaller -and a subset- of the set of always satisficing alternatives for decision maker ˜I . Intuitively, lower utility thresholds are associated to a bigger set of satisficing alternatives and, therefore, a big- ger set of alternatives that are always chosen when present in a menu. The following claim formalizes this intuition. Claim 2.19 Let I = (u, u∗ , {γA}A∈D ) and ˜I = (u, u ˜∗ , {γA}A∈D ) be two FSSM (FDSM) deci- sion makers then: 1. If u∗ > u ˜ ∗| ˜∗ then |W ∗ | ≤ |W 93 2. If |W ∗ | < |W ˜ ∗ | then u∗ > u ˜∗ To study the effect of the utility threshold on the well being of DMs, we first introduce some definitions. Definition 2.20 (Utility Distribution) For a FSSM (FDSM) decision maker I = (u, u∗ , {γA}A∈D ) we define the utility distribution for menu A as the distribution over utility levels implied by the model X p I (˜ u, A) = 1{u(a) = u ˜}p(a, A) a∈A and the cumulative distribution function X F I (u, A) = p I (u, A) u≤u Definition 2.21 (Expected Utility) The expected utility for a FSSM (FDSM) decision maker I = (u, u∗ , {γA}A∈D ) when choosing from menu A is given by X EuI (A) = p I (u(a), A)u(a) a∈A We show that a FSSM (FDSM) that has a lower utility threshold would make worse decisions. To understand this, consider two FSSM decision makers I = {u, u∗ , {γA}A∈D } and ˜I = {u, u ˜∗ , {γA}A∈D } with u∗ > u ˜ ∗ |. Therefore ˜∗ . Then, for any A ∈ D, |A∩ U ∗ | ≤ |A∩ U a DM with a lower threshold is now randomizing including alternatives with lower ex-ante utility level when satisficing alternatives are available menu. The following example illustrates how this mechanism works. Example 2.22 Let A = {a, b, c} and for simplicity assume that u(a) = 1, u(b) = 2 and u(c) = 3. Let u∗ = 2.5 and u ˜∗ = 1.5, and consider the search orders from Example 2.3. 94 For decision maker I, p(c, A) = 1 and p(a, A) = p(b, A) = 0,  0 if u < 3  F I (u, A) =  1 if u ≥ 3 11 13 and EuI (A) = 3. For decision maker ˜I , p(a, A) = 0, p(b, A) = 24 and p(b, A) = 24 , therefore  if u < 2    0   F˜I (u, A) = 11 if 2 ≤ u < 3  24    1 if u ≥ 3  ˜ 61 and EuI (A) = 24 < 3. This result is formalized in Proposition 2.23. Proposition 2.23 (Utility Threshold) Let I = (u, u∗ , {γA}A∈D ) and ˜I = (u, u ˜∗ , {γA}A∈D ) be two FSSM, with cumulative probability distribution over utility levels F I (u, A) and F˜I (u, A). If u∗ > u ˜∗ then F I (u, A) ≤ F˜I (u, A) for all u. Moreover, if F I (u, A) ≤ F˜I (u, A) with F I (u, A) 6= F˜I (u, A) then u∗ > u ˜∗ . ˜ ˜∗ then EuI (A) ≥ EuI (A) for Corollary 2.24 Under the conditions of Proposition 2.23, u∗ ≥ u ˜ all a ∈ D. Moreover, if EuI (A) > EuI (A) then u∗ > u ˜∗ . To analyze the overall cost in terms of utility for a decision maker that utilizes a lower utility threshold we note that, from Corollary 2.24, if u∗ ≥ u ˜∗ then X X ˜ EuI (A) ≥ EuI (A) A∈D A∈D ˜∗ ≤ u(a) < u∗ for some A ∈ D} 6= ;. and the last inequality is strict whenever {a ∈ A : u 95 Corollary 2.25 Under the conditions of Proposition 2.23 and assuming complete data sets, if u∗ ≥ u ˜∗ then |U ∗ | ≤ |U ˜∗ ≤ u(a) < u∗ for some A ∈ D} 6= ; ˜ ∗ |. Moreover, if {a ∈ A : u then |U ∗ | < |U ˜ ∗| A direct implication of Corollary 2.25 is that, whenever a satisficing decision maker |U ∗ | lowers her utility threshold, the relative size of the set of irrational choices, |X | , in- creases. Alternatively, we can study the effect of changes in the search orders. Intuitively, if the relative probability of search orders changes in such a way that search orders congruent with preference orders become more likely then the expected utility of their decisions improve. The following example illustrates how this mechanism works. Example 2.26 Consider again example 1 with u(a), u(b) > u∗ > u(c) and search orders 1 γA and γ ˜A as shown in Table 2.2; where γ ˜A({a, b, c}) = γA({a, b, c})+ 12 and γ ˜A({b, a, c}) = 1 3 5 γA({b, a, c}) − 12 . Note that for DM I we have that p I (a, A) = 8, p I (b, A) = 8 and ˜ 11 ˜ 13 ˜ p I (c, A) = 0; while for DM ˜I we have that p I (a, A) = 24 , p I (b, A) = 24 and p I (c, A) = 0. For simplicity assume that u(a) = 3 and u(b) = 2. Correspondingly, the cumulative dis- tribution functions are given by  if u < 2    0   F I (u, A) = 5 if 2 ≤ u < 3  8    1 if u ≥ 3  19 with EuI (A) = 8 .  if u < 2    0   F˜I (u, A) = 13 if 2 ≤ u < 3  24    1 if u ≥ 3  96 Order (1) (2) (3) (4) (5) (6) 1st a a b b c c 2nd b c a c a b 3rd c b c a b a γA 1 12 1 6 1 3 1 24 1 8 1 4 γ ˜A 1 6 1 6 1 4 1 24 1 8 1 4 Table 2.2: Changes in the probability of search orders and its effect on DM’s well-being. 59 19 with EuI (A) = 24 > 8 . This result is formalized in Appendix 2.A.9. 2.4 Extensions In this section we extend our model by relaxing in turn the assumptions of full support, no indifference, and complete data 2.4.1 Fixed Distributions Without Full Support As discussed in section 2.3.1, the GSM model is vacuous without the full support as- sumption. Here we consider the empirical implication of dropping full support but maintaining the fixed distribution assumption. In such a case the identification of sat- isficing elements as those that are always chosen breaks down. A satisficing element a may not be chosen in some sets if it is always searched after another satisficing element b. To illustrate this point consider the following example. Example 2.27 Let X = {a, b, c, d}, U ∗ = {a, b} and Γ a distribution over search orders on X with full support and where each possible search order is equally likely; moreover assume that u(c) > u(d). Then, for any menu such that U ∗ ⊆ A, p(a, A) = p(b, A) = 12 , and if 97 A ∩ U ∗ 6= ;, then p(U ∗ , A) = 1 and p(c, A) = p(d, A) = 0. Finally, p(c, {c, d}) = 1 and p(d, {c, d}) = 0. Note that this is the standard case describe in section 2.3.3. Now, notice that since we do not assume that Γ needs to have full support on the set of search orders on X , the same data set can be generated by the following fixed distribution satisficing model ¯ ∗ = X , Γ ((a, b, c, d)) = Γ ((b, a, c, d)) = 1 without full support: U 2 and Γ (rX ) = 0 for all rX / {(a, b, c, d), (b, a, c, d)}. Furthermore, this alternative linear order on X , such that rX ∈ representation is not unique. Because it is not possible to identify the satisficing alternatives the only implication of the satisficing model without full support, but with fixed distribution is Total Mono- tonicity - in other words it is behaviorally indistinguishable from the random utility model. This can be seen by noting that a RUM can be reinterpreted as a satisficing model with fixed distribution by assuming that all alternatives are above the reserva- tion level, and treating the preference orderings from the random utility model as search orders in the satisficing model. Proposition 2.28 The following are equivalent: 1. A complete stochastic choice dataset (D, p) is generated by a FDSM without Full Support. 2. A complete stochastic choice dataset (D, p) satisfies Axiom 2.16. 2.4.2 Allowing for Indifference Here we relax the no indifference assumption while keeping the Full Support conditions. Allowing for indifference potentially introduces stochasticity among non-satisficing al- ternatives due to the DM’s rule to break ties. We assume that tie breaking works as follows, if the DM is indifferent between two or more alternatives, and needs to choose one of them, she chooses at random from the set of indifferent alternatives with prob- 98 abilities induced by the tie-breaking rule T . Definition 2.29 (Tie-breaking rule) Let T : X ×D → R++ 9 be a function that assigns tie breaking weights to alternatives. In case of indifference between two or more alternatives in menu A, the DM applies the induced tie breaking rule as follows: T (a, A) T (a|A∼a ) = P (2.2) b∈A∼a T (b, A) where T (a|A∼a ) > 0 is the always positive probability that a is chosen when a is indifferent to all the elements in the set A∼a ≡ {b ∈ A : u(b) = u(a)}, and superior to all other elements in A (i.e. u(a) ≥ u(b) for all b ∈ A). Note that if |A∼a | = 1 then T (a|A∼a ) = 1, and that T (a|A∼a ) = 1 in general. P b∈A∼a We now extend the Full Support Search Model to allow for indifferences. Definition 2.30 (Full Support Search Model with Indifferences (FSSMI)) A data set (D, p) has a Full Support Search Model with Indifferences (FSSMI) representation if there exists u : X → R, u∗ ∈ R such that u∗ ≤ maxa∈X u(a), stochastic search orders{γA}A∈D that satisfies Fixed Distribution and tie breaking rule T : X ×D → R++ , such that, for any a ∈ A  γA (arA b ∀ b s.t. u(b) > u∗ ) if u(a) ≥ u∗       p(a, A) = T (a|A∼a ) if a ∈ arg max x∈A u(x) < u∗ (2.3)     0  Otherwise The following example illustrates the FSSMI. 9 Note that we rule out deterministic tie breaking since, the behavior of a DM that is indifferent between two alternatives a and b and always chooses a over b is behaviorally indistinguishable from a DM that prefers a over b. 99 Example 2.31 Consider again example 2.3, and assume that DM’s choices can be repre- sented by a Full Support Seach Model with Indifferences. Let U ∗ = {a}, and let u(b) = u(c). Then, p(a, A) = 1, p(b, A) = p(c, A) = 0 for all A such that a ∈ A. Let the tie breaking rule be generated by T (b, {{b, c}) = 1 and T (c, {b, c}) = 2, then p(b, {b, c}) = 31 , and 2 p(c, {b, c}) = 3 Axiom 2.8 is no longer necessary for the FSSMI model: stochasticity can occur amongst alternatives that are not always chosen due to indifference. In fact, it turns out that the behavioral implication of allowing for indifference is precisely the removal of this axiom from our set of necessary and sufficient conditions. Theorem 2.32 The following are equivalent: 1. A complete stochastic choice dataset (D, p) is generated by a FSSMI. 2. A complete stochastic choice dataset (D, p) satisfies Axiom 2.10. Theorem 2.32 highlights that the satisficing model without indifference can be rein- terpreted as a standard optimizing model with random tie breaking, but allowing for indifference only amongst the best alternatives. Recoverability in the FSSMI Given u∗ ≤ maxa∈X u(a), the extension of the model to allow for ties, does not obscure the identification result for the satisficing set as in 2.6, where the always chosen set coincides with the satisficing set, i.e. W ∗ = U ∗ . The following theorem describes the degree to which the other elements of the model cannot be identified. Theorem 2.33 Let (D, p) be a complete data generated by an FSSMI (u, u∗ , {γA}A∈D , T ). For any FDSMI representation of the data (¯ ¯∗ , {γA}A∈D , T¯ ) u, u 1. U ∗ = U ¯∗ 100 / U ∗ , u(a) ≥ u(b) ⇒ u 2. For all a, b ∈ ¯(a) ≥ u ¯(b) 3. γA(arA b ∀b ∈ {A ∩ U ∗ } \ {a}) = γ ¯A(arA b ∀b ∈ {A ∩ U ∗ } \ {a}) for all A ∈ D, a ∈ A ∩ U∗ 4. T (a|A∼a ) = T¯ (a|A∼a ) for all a ∈ A, A ∈ D. Theorem 2.33 tells us that in a complete data set we can uniquely identify the above- satisficing elements, the preference ordering among non-satisficing elements, the tie breaking rule when used and the probability that one satisficing element is seen before another in any choice set. Its proof follows from Theorem 2.14 when the identification of the tie breaking rule is established. The identification of the former (tie braker rule) holds because whenever it is used, it is calibrated from the empirical choice probability among elements that are not revealed to be satisficing. 2.4.3 Incomplete Datasets Here we relax the complete data set assumption while keeping the full support distribu- tion condition and assuming no indifference. We do not work with the fixed distribution assumption since Total Monotonicity is not well defined for incomplete data sets, and the literature on Random Utility Models has not dealt with this extension. Notice that complete data is not a necessary assumption for Theorem 2.11. Thus, if we drop completeness, the implications of the model are not affected, but identification becomes weaker. To see this note W ∗ may be a strict superset of U ∗ since a below- satisficing alternative may be always chosen because it is only observed in choice sets containing below-satisficing alternatives. The accuracy with which we can identify the primitives of the model given observed data depends on the richness of the data set. 101 Theorem 2.34 Let (D, p) be a data set (that needs not to be complete) that satisfies Ax- iom 2.8 and Axiom 2.10. Then for a FSSM (u, u∗ , {γA}A∈D ) represents the data if and only if f ⊆ U∗ ⊆ W ∗ 1. W 2. u must represent the stochastic revealed preference relation on X \ U ∗ : that is if a is stochastically strictly revealed preferred to b then u(a) > u(b), and if a is revealed preferred to b then u(a) ≥ u(b) for any a, b ∈ X \ U ∗ 3. γA(a rA b ∀b ∈ (U ∗ ∩ A) \ {a}) = p(a, A) for all A ∈ D, a ∈ U ∗ ∩ A. f ≡ {a ∈ X |∃A ∈ D, s.t p(a, A) ∈ (0, 1)}. where W Theorem 2.34 tells us that, when dealing with incomplete data sets one can only identify with certainty satisficing choices, if these have been observed chosen when other satisficing alternatives were available as well; that is, if we see them being chosen stochastically. As we established before, the satisficing set is a subset of the set of always chosen alternatives, but with incomplete data it may be a strict subset. Furthermore, one can only partially recover the preference order for the revealed not satisficing al- ternatives. Finally, the search orders can only be identify up to the set of those that coincides with the one generated from the relative probabilities of the elements that surely are in U ∗ . 2.4.4 Random Utility and Random Threshold Now we turn to the study of two variants of the satisficing model where we allow ran- domness in tastes and in the threshold rule. The first generalizes the constant threshold assumption and allows it to be random but common across all items and menus. The second variant focuses on letting utility taken as fixed before to be also random (i.e., random utility). 102 Random Threshold Here we explore the satisficing model when we allow for a random threshold. We estab- lish the somewhat surprising result that assuming a random threshold that is indepen- dent of the search process and common across items and menus, adds no generality to the satisficing model. In fact, these variants are indistinguishable from the FSSM/FDSM with constant threshold. The probability of any item of being satisficing is given by the probability of the common random threshold (CRT) variable of being below this number. It is common because it remains the same across menus and items. We assume throughout the re- maining of the section that the random threshold is independent of the random search. Definition 2.35 (FSSM with Common Random Threshold (FSSM-CRT)) A data set (D, p) has a a Full Support Satisficing Model with Common Random Threshold (FSSM-CRT) if there exists an injective u : X → R, a continuous random variable u∗ ∼ Fu∗ (·) such that τ(a) = P r(u(a) > u∗ ) = 1 − Fu∗ (u(a)) with suppor t(u∗ ) ⊆ R such that |U ∗,CRT | ≥ 2 with U ∗,CRT = {x ∈ X : τ(x) > 0}, and {γA}A∈D with the Full Support property, such that, for any A ∈ D and a ∈ A: X Y Y p(a, A) = τ(a) (1 − τ(b))γA(rA) + (1 − τ(c))1(u(a) > u(b)∀b ∈ A\{a}). rA∈RA b∈A:brA a c∈A (2.4) Notice that the term τ(a) b∈A:brA a (1 − τ(b)) Q measures the probability of item a ∈ A being satisficing given that all items searched before it in menu A under fixed search rA where not satisficing, similarly c∈A(1 − τ(c)) is the probability of not finding any- Q thing satisficing in A,10 and 1(u(a) > u(b)∀b ∈ A) = 1 when a is maximal under u 10 P Notice that Q the summation of the first Q term of the expression over all Qitems P(A) = τ(c) (1 τ(b))γ (r ) = c∈A(1 − τ(c)), thus the 1 − P(A) = c∈A(1 − τ(c)) is P c∈A rA ∈RA b∈A:brA c − A A 1 − the residual probability. 103 in A and zero otherwise. We assume also that the productory over the empty set is 1, ; (1 − τ(b)) = 1. Q We first study the implications of CRT assumption with the Full Support assumption on the search process. Lemma 2.36 A complete stochastic choice dataset (D, p) that can be generated by a FSSM- CRT satisfies Deterministic no satisficing choice (Axiom 2.8) and SARP (Axiom 2.10). The previous lemma allows us to establish the following equivalence result. Theorem 2.37 The following statements are equivalent: 1. A complete stochastic choice dataset (D, p) can be generated by a FSSM. 2. A complete stochastic choice dataset (D, p) can be generated by a FSSM-CRT. Now we explore the CRT assumption combined with the fixed distribution assump- tion over the random search. Definition 2.38 (FDSM with Common Random Threshold (FDSM-CRT)) A data set (D, p) has a Fixed Distribution Satisficing Model with Common Random Threshold (FDSM- CRT) representation if it has a FSSM-CRT representation in which the family of stochastic search orders {γA}A∈D satisfy the Fixed Distribution property. In what follows, we will prove that the FDSM-CRT is behaviorally indistinguishable from the FDSM. But first we need two technical lemmata. Lemma 2.39 A weighed sum of totally monotonic mappings ˆpi : X × D 7→ [0, 1] for i ∈ {1, · · · , I} with I ≥ 1 an integer, evaluated at some (a, A) such that a ∈ A and A ∈ ˆ (a, A, ω) = I ωi ˆpi (a, A) with ωi ≥ 0 a non-negative weight also satisfies total P D, P i=1 monotonicity. 104 Lemma 2.40 The mapping P ˆ : X × D 7→ [0, 1] defined by (a, A) →7 T (A)1(u(a) > u(b)∀b ∈ A\{a}) for a fixed a ∈ A is totally monotone, where T (A) = c∈A(1 − τ(c)). Q With this in hand we are ready to state and prove the empirical implications of FDSM-CRT. Lemma 2.41 A complete stochastic choice dataset (D, p) that can be generated by a FDSM- CRT satisfies Deterministic no satisficing choice (Axiom 2.8), SARP (Axiom 2.10) and Total Monotonicity (Axiom 2.16). The previous lemma allows us to establish the empirical equivalence between FDSM with and without CRT. Theorem 2.42 The following statements are equivalent: 1. A complete stochastic choice dataset (D, p) can be generated by a FDSM. 2. A complete stochastic choice dataset (D, p) can be generated by a FDSM-CRT. We conclude that adding the CRT does not add generality to the FSSM and FDSM as it does not change the empirical implications of these satisficing models. Random Utility FDSM Here we focus on a variant of the FDSM where we also allow for random utility.11 We consider a model where there is a probability measure ρ : U 7→ [0, 1] defined over a domain of injective utilities defined on X denoted by U . For a given element in the support u ∈ U , we assume that the DM is consistent with FDSM or puF DSM with {u, u∗ , {γA}A∈D } (i.e. u∗ , and {γA}A∈D common for each u ∈ U ): 11 We omit the variant of FSSM with random utility as it is too non-restrictive, in fact, the reader can easily check using the results in this section that the necessary and sufficient condition that characterizes FSSM with random utility is the Degeneracy condition (A 2.44). 105 Definition 2.43 (Random Utility FDSM representation (RU-FDSM)) A data set (D, p) has a Random Utility FDSM representation (RU-FDSM) if it is the average of the FDSM models over all possible utilities in (ρ, U ): X p(a, A) = ρ(u)puF DSM (a, A) u∈U We note that the new model does not satisfy SARP (Axiom 2.10) nor Deterministic non satisficing choice (Axiom 2.8). It satisfies however: Axiom 2.44 (Degeneracy) For x ∈ X \W ∗ and A ∩ W ∗ 6= ; then p(x, A) = 0. Notice also the following direct results for RU-FDSM: Claim 2.45 If the data set (D, p) has a RU-FDSM representation, then W ∗ = {a ∈ X : p(a, A) > 0, ∀A ∈ D} ≡ {a ∈ X : ∃ u ∈ U , ρ(u) > 0, u(a) > u∗ } corresponds to the set of “sometimes” satisficing items. Also X \W ∗ corresponds to the set of never satisficing items. Now we considering the following property, where any item that is satisficing for some fixed utility is satisficing for all utilities in the domain of the random utility (ρ, U ). This property is closer to the FDSM as it restricts the taste variation for satisficing items. Definition 2.46 (Always satisficing random utility distribution) A random utility dis- tribution is always satisficing if for any item x ∈ X such that for some u ∈ U with ρ(u) > 0, and u(x) > u∗ it follows that for any other u ˆ ∈ U with ρ(ˆ u) > 0 it also ˆ(x) > u∗ . holds that u It is clear that this property allows for a sharper interpretation of W ∗ . In fact, con- sider the following claim. Claim 2.47 If complete stochastic choice dataset (D, p) can be generated by a RU-FDSM with Always satisficing random utility distribution, then if a ∈ W ∗ it follows that P r(u : u(a) > u∗ ) = 1 and P r(u : u(b) > u∗ ) = 0 for b ∈ X \W ∗ . Thus W ∗ = {a ∈ X : u(a) > u∗ ∀u ∈ U , ρ(u) > 0} is the set of always satisficing items. 106 We are ready to characterize the RU-FDSM, and we notice already that it is observa- tionally equivalent to the RU-FDSM with Always satisficing random utility distribution. Theorem 2.48 The following are equivalent: 1. A complete stochastic choice dataset (D, p) can be generated by a RU-FDSM. 2. A complete stochastic choice dataset (D, p) satisfies Axiom 2.16 and Degeneracy (Ax- iom 2.44). 3. A complete stochastic choice dataset (D, p) can be generated by a RU-FDSM with Always satisficing random utility distribution. Identification of the Utility of Satisficing Items with Random Utility and Random Threshold. After studying the FSSM and the FDSM a less satisfactory feature of the models is that we cannot recover information about the utility of satisficing elements. It is an interest- ing question to explore whether this characteristic of the FSSM/FDSM is inherent to the general satisficing procedures or not. In this section we use the partial exploration of random threshold/random utility variants of the satisficing model to try to contribute to answer this question. In general, we find that inferring information about the intensity of the utility of satisficing items is still problematic. For the case of a common random threshold Theorem 2.42 says that in a complete standard stochastic choice dataset (D, p) FDSM and FDSM-CRT are indistinguishable from one another. In fact, this gives as a corollary that the random threshold distri- bution u∗ ∼ Fu∗ is not identified. We can only separate the elements that have a pos- itive probability of being over the threshold from those that have a zero probability of surpassing it (this follows from the fact that W ∗ ≡ U ∗ in the case of FDSM and W ∗ ≡ {a ∈ X : τ(a) > 0} in the FDSM-CRT). This partially identifies the support of Fu∗ 107 (up to monotone transformations of the random variable) but not the actual distribu- tion. The following corollary follows trivially from Theorem 2.42. Corollary 2.49 If a complete stochastic choice dataset (D, p) can be generated by a FDSM- CRT, then the utility of satisficing elements such that U ∗,CRT = {a ∈ X : τ(a) > 0} is not identified. From Theorem 2.42 it is also clear that only restricting the search procedure cap- tured by {γA}A∈D beyond the fixed distribution assumption we can hope to obtain more information about the random variable u∗ and about the utility level of satisficing items. In fact this seems to be true for a uniform search restriction (and the implications of such model are left as an open question). The RU-FDSM clearly contains the FDSM-CRT/FDSM and is not contained by the latter by Theorem 2.48. In consequence, we expect no gains in allowing random utility to identify utility of satisficing items. However, RU-FDSM is interesting because one could argue that the fact that some items are falling below the fixed threshold u∗ ran- domly could help to identify some form of average utility like the quantity P r(u : u(a) > u∗ ) = u∈U ρ(u)1(u(a) ≥ u∗ ). However, we have the following corollary of Theorem P 2.48 that establishes that P r(u : u(a) > u∗ ) is not identified. Corollary 2.50 If a complete stochastic choice dataset (D, p) can be generated by a RU- FDSM (i.e., a list (ρ, U , puF DS M ) with P r(u ∈ U : u(a) > u∗ ) = u∈U ρ(u)1(u(a) ≥ P u∗ ) > 0 for a ∈ W ∗ and zero otherwise, there is an alternative representation RU-FDSM with Always satisficing propery of the dataset (D, p) (i.e, a list (ρ ∗ , U ∗ , pu∗,F DSM )) such that P r(u ∈ U ∗ : u(a) > u∗ ) = 1 if a ∈ W ∗ and zero otherwise. Recall, that FDSM allows us to recover 1(u(a) > u∗ ) only, which implies the lack of identification of utility (or utility intensity for satisficing items). Thus the RU-FDSM fares no better than the FDSM in identifying satisficing items utility levels, because 108 under the Always satisficing property P r(u ∈ U ∗ : u(a) > u∗ ) is numerically equivalent to 1(u(a) > u∗ ), because if a ∈ W ∗ then 1(u(a) > u∗ ) = 1 and zero otherwise. 2.5 Relation to Existing Literature The paper closest in spirit to ours is (Manzini and Mariotti 2014), which characterizes the random choice generated by a DM who makes choices by optimizing on a stochasti- cally generated consideration set. As in our model, preferences are deterministic, with randomness in choice coming from stochastic changes is attention. However, the behav- ioral implications of the two models are quite different, with the satisficing model being the more general. In the set up of MM, all alternatives are always chosen with positive probability in each set. In such a data set, axioms Axiom 2.8 and Axiom 2.10 are always satisfied, and so the FSSM is trivially more general than the stochastic consideration set model. Moreover, the FDSM also nests the stochastic consideration set model. This fol- lows from the fact that, when restricted to the class of data in which all alternatives are chosen with positive probability, the FDSM model is equivalent to the class of all RUMs, and the model of MM is a strict subset of this class. Moreover, the FSSM and FDSM can accommodate data sets in which not all alternatives are chosen with positive probability. Our work also contributes to the literature aimed at testing the satisficing model. It is well known that standard deterministic choice data cannot be used to distinguish rational choice from satisficing behavior, implying that richer data is needed. (Caplin, Dean, and Martin 2011, Caplin and Dean 2011) showed how to test the satisficing model using ‘choice process’ data, which records not just final choice made by a deci- sion maker, but also how choices change with contemplation time. (Santos, Hortacsu, and Wildenbeest 2012) utilize data in which the sequence of search is recorded to test the satisficing model. This chapter describes the implication of the satisficing model for 109 stochastic choice data, which is arguably easier to collect that either choice process or search data. Another relevant paper is (Rubinstein and Salant 2006) that studies the implications of choices from lists that include as a special case a variant of the satisficing model where the DM chooses the first element in a list (that could be put one to one with a linear search/ordering) that is above a threshold, if none she chooses the last one. We differ from this effort in that we assume we do not observe the lists or linear orderings, in- stead we infer the distribution of the linear search from the frequency of choice. Ours is not the first paper to characterize the behavior of random choice rules. Much of the previous work has focused on random utility models (RUMs), in which the DM chooses in order to maximize a utility function, drawn from some distribu- tion (see for example (Block and Marschak 1960, Falmagne 1978, Gul, Natenzon, and Pesendorfer 2014)). As discussed above, the FSSM is behaviorally distinct from the class of RUMs. It is easy to construct examples of FSSMs which violate regularity, and so cannot be modeled as the resulting from random utility maximization. Moreover, RUMs are not guaranteed to satisfy either axioms Axiom 2.8 or Axiom 2.10. In contrast the FDSM is behaviorally a subset of the class of RUMs. Total monotonicity Axiom 2.16 is necessary and sufficient for a RUM representation (Falmagne 1978). 110 2.A Proofs 2.A.1 Proof of Proposition 2.4 Proof. For any data set (D, p), set U : X → [0, 1] to be any arbitrary one to one real valued function and set u∗ = −2; then U ∗ = X . For any menu A ∈ D, let rAa be the set of  linear orders on A such that a r b for all b ∈ A. rAa a∈A therefore defines partition on RA. Define p(a, A) γA(rA) = where rA ∈ rAa |rAa | Such a representation will generate p as, for any a, u(a) ≥ u∗   γA rA|a rAa b ∀ b s.t. u(b) ≥ u∗ = γA rA ∈ rAa = p(a, A) 2.A.2 Proof of Lemma 2.7 Proof. (W ∗ ⊆ U ∗ ) Let a ∈ W ∗ then for any given A ∈ D we have p(a, A) > 0 then either: (i) a ∈ U ∗ or (exclusive) (ii) u(a) > u(b) for all b ∈ A and U ∗ ∩ A = ;. Assume / U ∗ then (since U ∗ 6= ;) there exists b ∈ U ∗ and, by completeness of the data, there is a∈ a menu A0 ∈ D such that a, b ∈ A0 , and therefore p(a, A0 ) = 0, so we have a contradiction. (U ∗ ⊆ W ∗ ) Let a ∈ U ∗ ⇒ u(a) > u∗ , by FSSM p(a, A) = γA (rA|arA b ∀b ∈ A s.t. u(b) > u∗ ) > 0 where the last inequality follows from Full Support assumption. 111 2.A.3 Proof of Theorem 2.11 Proof. First we prove that (1) implies (2). To prove that a data set (D, p) that admits a FSSM representation satisfies Axiom 2.8 first notice that U ∗ ⊆ W ∗ even for incom- plete data sets. To see this note that if a ∈ U ∗ then u(a) ≥ u∗ and since the data has a FSSM representation then p(a, A) = γA (rA|a rA b ∀b ∈ A ∩ U ∗ \{a}) for all A ∈ D. Given the full support assumption, γA (rA|a rA b ∀b ∈ A ∩ U ∗ \{a}) > 0 for all A ∈ D, therefore p(a, A) > 0 for all A ∈ D which in turn implies that a ∈ W ∗ . / W ∗ we have that a ∈ Then if a ∈ / U ∗ , which in turn implies that u(a) < u∗ . Since u is injective, there exists a aA∗ = ar gmax a∈Au(a). Then either (i) a = ar gmax b∈Au(b) or (ii) u(a) < max b∈A u(b). If (i) since the data has a FSSM representation p(a, A) = 1; while if (ii) p(a, A) = 0. In either case Axiom 2.8 follows. To show that Axiom 2.10 holds, assume, by the way of contradiction, that (i) a is stochastically revealed preferred to b and (ii) b is stochastically strictly revealed pre- ferred to a. From (ii), given that data admits a FSSM we must have (by the full support assumption) that u(a) < u∗ and u(a) < u(b). If a is stochastically revealed preferred to b then there must exist a sequence of alternatives c1 , ..., cN and choice sets A1 , ..., AN −1 , such that c1 = a, cN = b, cn , cn+1 ∈ An and cn ∈ C(An ). If u(b) > u∗ then it must be the case that u(cN −1 ) > u∗ (otherwise cN −1 could not be chosen with positive proba- bility when cN was available). Iterating on this argument implies that u(a) > u∗ . If u(b) < u∗ then this implies that u(cN −1 ) > u(b). If u(cn ) < u∗ for all n then iterating on this argument implies that u(a) > u(b). Otherwise, the previous argument implies that u(a) > u∗ . Either provides a contradiction. Now we prove that (2) implies (1). For A ⊆ X \ W ∗ , C(A) ≡ {a ∈ A|p(a, A) = 1} from Axiom 2.8. Given Axiom 2.10, we can generate an injective utility function using 112 Afriat/Richter’s theorem, such that u : X \ W ∗ → [0, u] and C(A) = ar gmax a∈Au(a) for all A ∈ D such that A ∩ W ∗ = ;. Fix u∗ > u , enumerate the elements in W ∗ as a1 , ...aN and let u(an ) = u∗ +n for all an ∈ W ∗ . Thus u : X → R is injective and U ∗ ≡ W ∗ . Further- more, notice that Axiom 2.10 implies that W ∗ is non-empty, so that u∗ ≤ max x∈X u(x). For every A such that A∩W ∗ is non-empty, let A∗ ≡ A∩W ∗ and define RA∗ the set of all linear orders on A∗ , and let RA(rA∗ ) be the set of all linear orders rA ∈ RA that induce the linear order rA∗ ∈ RA∗ .Then, set the probability of the set of linear orders that generate each rA∗ as  p(ai , A) γA rA ∈ RA(rA∗ )| ai rA∗ a j ∀a j ∈ A∗ \{ai } =  | rA∗ ∈ RA∗ : ai rA∗ a j ∀a j ∈ A∗ \{ai } | for all ai ∈ A∗ . Finally, distribute the probability mass above uniformly across the ele- ments in RA(rA∗ ). For any rA ∈ RA(rA∗ ), for a given rA∗ :  γA(rA) = γA rA ∈ RA|ai rA∗ a j ∀a j ∈ A∗ \{ai } /|RA(rA∗ )| Where | · | stands for the cardinality map. Note that, as rA ∈ RA(rA∗ )| ai rA∗ a j ∀a j ∈ A∗ \{ai } for some ai ∈ A∗ , and as p(ai , A) > 0 by construction of W ∗ this distribution will have full support on RA. If A ∩ W ∗ = ; then for all rA ∈ RA define 1 γ(rA) = |RA| Thus, between them u, u∗ and {γA}A∈D satisfy the requirements of an FSSM. To verify that we can generate (D, p) notice that if we face a menu A we have the following cases: 113 (i) if A ∩ W ∗ = A then u(a) > u∗ for all a ∈ A and, for each a, p(a, A) = γA(rA ∈ RA|a rA b∀b ∈ A\{a}), to see that this is true observe that γA(rA ∈ RA|a rA b ∀b ∈   A\{a}) = rA∈RA 1 rA : rA∗ such that ai rA∗ a j ∀a j ∈ A∗ γA(rA) = p(a, A). P (ii) If A ∩ W ∗ = ; then p(a, A) = 1 if u(a) > u(b) for all b ∈ A\{a} and zero other- wise. This follows directly from the fact that u was constructed to represent choice on such sets. (iii) If A ∩ W ∗ ⊂ A and A ∩ W ∗ 6= ; then we have p(a, A) = γA(rA|a rA b ∀b ∈ (A ∩ W ∗ )\{a}) if u(a) ≥ u∗ and p(a, A) = 0 if u(a) < u∗ . To see that this is true observe that by definition of W ∗ and Axiom 2.8 p(A∩W ∗ , A) = 1. To see that this is true, observe that if we assume that p(A ∩ W ∗ , A) < 1 we must have that p(a, A) > 0 for some a ∈ / W∗ but that means by Axiom 2.8 that p(a, A) = 1 which is a contradiction of the fact that p(A ∩ W ∗ , A) > 0. Then p(a, A) = 0 if u(a) < u∗ . For a, b ∈ A ∩ W ∗ , the result follows as in (i). 2.A.4 Proof of Theorem 2.14 ∗ ∗ Proof. To prove (1) assume, by contradiction, that U ∗ 6= U and let v ∈ U \ U ∗ . Then, ∗ it must be the case that u(v) < u∗ and u(v) ≥ u . By completeness and the fact that U ∗ 6= ;, ∃a ∈ U ∗ and a A ∈ D such that A = {a, v}. Given that the data is represented by ∗  (u, u∗ , {γA}A∈D ), p(a, A) = 1. On the other hand, since (u, u , γA A∈D ) also represents the data it is the case that p(a, A) ∈ (0, 1), which establishes a contradiction. ∗ ∗  To prove (2) notice that from (1) U ∗ = U . Since, (u, u∗ , {γA}A∈D ) and (u, u , γA A∈D ) are generated by the same FSSM, then u and u∗ represent the preferences given by def- / U ∗ . Therefore, it must be the case that u is a strictly increasing inition 2.9 for all a ∈ 114 transformation of u. on X /U ∗ To prove (3) assume, by contradiction, that for some A ∈ D, a ∈ A ∩ U ∗ γA(arA b ∀b ∈ {A ∩ U ∗ } \ {a}) 6= γ ¯A(arA b ∀b ∈ {A ∩ U ∗ } \ {a}) ∗  then p(a, A|γA) 6= p(a, A|γA) which in turn implies both, (u, u∗ , {γA}A∈D ) and (u, u , γA A∈D ) cannot represent the same data. 2.A.5 Proof of Theorem 2.17 Proof. First we prove (1) implies (2). If the complete data is generated by a FDSM then there is a triple (u, u∗ , ΓX ), take a realization of ΓX with support on R X and call it rX , then define the linear ordering on X X : (1) a X b if arX b and a, b ∈ U ∗ , (2) a X b if u(a) > u(b) and a, b ∈ / U ∗ and (3) a X b if a ∈ U ∗ , b ∈ / U ∗ . Now, assign this linear ordering X the probability ΓX (rX ). It is direct to see that there is a Random Utility Maximization model without indifference with realizations X with probability ΓX (rX ). By (Block and Marschak 1960) it follows that the generated data set (D, p) satisfies Total Monotonicity (Axiom 2.16). The fact that Axiom 2.8 and Axiom2.10 hold follows from Theorem 2.11. Second we prove (2) implies (1). Because FDSM is a subcase of FSSM and Ax- iom 2.8 and Axiom 2.10 hold we can build an utility u : X 7→ R and a threshold u∗ ∈ R such that u(x) ≥ u∗ for all x ∈ W ∗ and u(x) < u∗ for all x ∈ X \W ∗ . Finally, if a com- plete stochastic choice dataset (D, p) satisfies Axiom 2.16 then the model is a Random Maximization Utility model so we can recover a distribution over linear orders on X ΓXQ : RX 7→ [0, 1] such that p(a, A) = ΓXQ (ar X b ∀ b ∈ A) thanks to (Falmagne 1978). Call its support RX the set of quasi-search ordering, because they are the results of both 115 deterministic (degenerate) utility maximization choice and fixed random searching and satisficing behavior. To construct the ΓX we build each element of its support by tak- ing an element of the support of the Quasi-search ordering r X ∈ RX and we restrict it to W ∗ . Here we define an equivalence class on RX if they have the same restriction 0 r X |W ∗ , if we have any two elements r X , r X ∈ RX that have the same restriction to W ∗ 0 0 (i.e., x r X y ⇐⇒ x r X y for x, y ∈ W ∗ ) we say r X ≡W ∗ r X , the equivalence class set 0 0 is denoted as [r X ]≡W ∗ = {r X ∈ RX |r X ≡W ∗ r X } then we assign to the representative of the equivalence class or the restriction rX |W ∗ the probability corresponding to the sum 0 ΓXQ (r X ). For any given restricted ordering r X |W ∗ we build its transitive clo- P r 0 ∈[r X ]≡ X W∗ sure or the set of transitive extensions to X and call this set R X (r X ) ⊂ X × X . We assign each of the elements of this set ˆrX ∈ R X (r X ) the probability r X ∈[r X ]≡ ΓXQ (r X )/|R X (r X )| P W∗ where the numerator is the probability of the restricted to W ∗ quasi-search ordering r X |W ∗ and the denominator is the cardinality of the previously defined set. Doing this for all elements of RX we build a new support R X with probabilities as indicated that provide us with ΓX . Note that ΓX has full support due to how W ∗ is constructed. Because W ∗ is the al- ways chosen set, we know that any element of the set of restrictions r X |W ∗ such that for each a ∈ W ∗ , ar X b for all b ∈ W ∗ \{a} has positive probability. It follows that by definition, for any A ∈ D, and for any x ∈ W ∗ we have p(x, A) > 0, this means that ΓXQ (r X ∈ RX : x r X y ∀ y ∈ X \{x}) > 0. This implies that all representatives of the equivalence class of restricted orderings rX |W ∗ where x is searched first in W ∗ have positive probability. Then we have extended them to X with the uniform distribution for each set R X (r X ) thus preserving the full support for the whole X . The reason is that the transitive closure to X of any rX |W ∗ contains linear search orders that have each x ∈ W ∗ as the first searched element, because it contains the ordering that preserves the elements in W ∗ in the top and the rest at the bottom. But also it contains search 116 orders with each element of X \W ∗ at the top for each of such elements and W ∗ at the bottom all with positive probability. We have extended each restriction r X |W ∗ such that we can let ΓX be the fixed dis- tribution of search orders. We have built a FDSM or a triple (u, u∗ , ΓX ). To verify that this FDSM model generate (D, p) notice that if we face a menu A we have the following cases: (i) if A ∩ W ∗ = A then p(a, A) = ΓX (rX ∈ R X | a rX b ∀b ∈ A\{a}), to see that this is true observe that ΓX (rX ∈ R X | a rX b ∀b ∈ A\{a}) = ΓXQ (r X ∈ RX |a r X b ∀b ∈ A\{a}). In this case the first equality follows from the equivalence to random utility when re- stricted to W ∗ . (ii) If A∩W ∗ = ; then p(a, A) = 1 if u(a) > u(b) for all b ∈ A\{a} and zero otherwise. This is direct from the fact that u represents C in such sets. (iii) If A∩W ∗ ⊂ A then we have p(a, A) = ΓX (rX |a rX b ∀b ∈ (A∩U ∗ )\{a}) if u(a) ≥ u∗ and p(a, A) = 0 if u(a) < u∗ . To see that this is true observe that by definition of W ∗ and Axiom 2.8 p(A∩ W ∗ , A) = 1 then p(a, A) = 0 if u(a) < u∗ . For a, b ∈ A∩ W ∗ observe that by construction ΓX (rX ∈ R X |arX b ∀b ∈ (A ∩ U ∗ )\{a}) = ΓXQ (r X ∈ RX |ar X b ∀b ∈ (A ∩ W ∗ )\{a}). Finally, observe that ΓXQ (r X ∈ RX |ar X b ∀b ∈ (A ∩ W ∗ )\{a}) = ΓXQ (r X ∈ RX |ar X b ∀b ∈ A\{a}) = p(a, A) because the facts that p(c, A) = 0 for any c ∈ / W ∗ and that ΓXQ represents choice implies that ΓXQ (r X ∈ RX |cr X a) = 0. 2.A.6 Proof of Theorem 2.18 Proof. For (1) and (2) we use the results of Theorem 2.14. 117 (3) follows from the fact that W ∗ ≡ U ∗ and the FDSM behaves as Random Utility in this for the elements in W ∗ . Assume by contradiction that ΓX (x rX y) 6= Γ X (x rX y) for some x, y ∈ W ∗ but that means that in the menu {x, y}, p(x, {x, y}|ΓX ) 6= p(x, {x, y}|Γ X ) which is a contradiction. 2.A.7 Proof of Claim 2.19 ˜ ∗ . Moreover, a difference in Proof. To prove (1) note first that it is trivial that W ∗ ⊆ W the utility threshold level only affects decisions if {a ∈ X : u(a) ∈ [˜ u∗ , u∗ )} ∩ A 6= ;, for some A ∈ D, otherwise W ∗ = W ˜ ∗ . Assume now that A˜ ≡ {a ∈ X : u(a) ∈ [˜ u∗ , u∗ )} 6= ;. By the definition of the FSSM (FDSM), it is the case A˜ ⊂ W ˜ ∗ and we get the desired result. To prove (2) note that DM are identical up to u∗ , u˜∗ then if W ∗ 6= W ˜ ∗ , it must be the case that u∗ 6= u ˜∗ . Assume by the way of contradiction that u∗ ≤ u ˜ ∗ > u∗ . ˜∗ , then u Then, by the same argument as above either W ∗ = W ˜ ∗ or {a ∈ X : u(a) ∈ [u∗ , u ˜∗ )} = 6 ; ˜ ∗ 6= ; which induces a contradiction. and then W ∗ \ W 2.A.8 Proof of Proposition 2.23 ˜∗ then F I (u, A) ≤ F˜I (u, A) for all u. Proof. First we prove that if u∗ ≥ u For any menu A ∈ D, a change in the utility threshold level only affects decisions if there exists a ∈ A such that u(a) ∈ [˜ u∗ , u∗ ), otherwise F I (·, A) = F˜I (·, A) and the result is trivially true. Now consider the case where there exists at least one a ∈ A such that u(a) ∈ [˜ u∗ , u∗ ). We have two cases, (i) A ∩ U ∗ = ; or (ii) A ∩ U ∗ 6= ;. If (i) and | {u : u = u(a) ∈ [˜ u∗ , u∗ ) for some a ∈ A} | = 1 then F I (·, A) = F˜I (·, A) and we have the desired result. If | {u : u = u(a) ∈ [˜ u∗ , u∗ ) for some a ∈ A} | > 1 then the 118 result follows by noticing that  for u < u ˜∗    0   F˜I (u, A) = ˜I u∗ ≤u(a) 0 given full support assumption. On the other hand, P with u∗ ≤u(a) 0. Then P By the full support assumption ˜ ∗ \U ∗ )∩A p b∈(U  0 for u < u ˜∗       ˜  p I (a, A) ˜ ∗ ≤ u < u∗ P for u   u∗ ≤u(a)≤u a:˜ F˜I (u, A) = (1 − β)F I (u, A) ˜ ∗ ≤ u < u∗     for u    for u ≥ u∗  1 ˜ with β = p I (a, A). P u∗ ≤u(a) 0 ˜ ∗ ≤ u < u∗ P for u   u∗ ≤u(a)≤u a:˜ F˜I (u, A) − F I (u, A) = β F I (u, A) > 0 ˜ ∗ ≤ u < u∗     for u    for u ≥ u∗  0 then F˜I (u, A) ≥ F I (u, A) for all u. Now we prove that if F I (u, A) ≤ F˜I (u, A) for all u with F I (u, A) 6= F˜I (u, A) then u∗ ≥ u ˜∗ . First notice that if F I (u, A) 6= F˜I (u, A) then it must be the case that u∗ 6= u ˜∗ . Assume by the way of contradiction that u∗ ≤ u ˜∗ > u∗ . Following the same argument as ˜∗ , then u in the previous part we have that, if there exists a ∈ A such that u(a) ∈ [u∗ , u ˜∗ ) then F I (u, A) ≥ F˜I (u, A) for all u with F I (u, A) 6= F˜I (u, A) which leads to a contradiction. 2.A.9 Comparative Statics with Respect to Search Orders First, we introduced the following class of equivalent search order for a given A ∈ D. Definition 2.51 ({a, b}-equivalent search orders) Let γA, γ ˜A be two probability distri- butions over search orders RA for some A ∈ D. We say that γA, γ ˜A are {a, b}-equivalent search orders for menu A if γA(rA) = γ ˜A(rA) for all rA such that rA\{a} ∗ = ˜rA\{a} and rA\{b} ∗ = ˜rA\{b} for some a, b ∈ A. Proposition 2.52 (Salience of Satisficing Elements) Let I = (u, u∗ , {γA}A∈D ) and ˜I = (u, u∗ , {˜ γA}A∈D ) be two FSSM, with cumulative probability distribution over utility levels F I (u, A) and F˜I (u, A). Let γA and γ˜A be {a, b}− equivalent search orders as in Defini-  tion 2.51 with γA(rA∗ ) = γ˜A(rA∗ )+" and γA(˜rA) = γ ˜A(˜rA)−" for some " ∈ γ ˜(˜rA), 1 − γ ˜A(rA∗ ) and arA∗ b and b˜rA a. 120 If u(a) ≥ u(b) then F I (·, A) FOSD F˜I (·, A). Moreover, if F I (u, A) ≤ F˜I (u, A) with F I (u, A) 6= F˜I (u, A) then u(a) > u(b) ≥ u∗ Proof. First we prove that if u(a) > u(b) then F I (·, A) FOSD F˜I (·, A). There are four / U ∗ then F I (·, A) = F˜I (·, A). (ii) If a ∈ U ∗ and b ∈ possible cases: (i) a, b ∈ / U ∗ , under the conditions of the proposition then F I (·, A) = F˜I (·, A). iii If b ∈ U ∗ and a ∈ / U ∗ , un- der the conditions of the proposition then F I (·, A) = F˜I (·, A). (iv) Finally, consider the ˜ case where a, b ∈ U ∗ . Under the conditions of the proposition p I (a, A) > p I (a, A) and ˜ ˜ p I (b, A) < p I (b, A) while p I (c, A) = p I (c, A) for all c ∈ A \ {a, b} and the results follows from u(a) > u(b) Now we prove that if F I (u, A) ≤ F˜I (u, A) for all u with F I (u, A) 6= F˜I (u, A) then u(a) > u(b) ≥ u∗ . Notice that the change in the probabilities of search orders only affects the relative probability of a, b being seen first, leaving unaltered all other rel- γA}A∈D to {γA}A∈D only has an effect ative probabilities. Therefore, the change from {˜ on the distribution over utility levels if {a, b} ⊆ U ∗ . Moreover, given the conditions ˜ ˜ of the proposition, p I (a, A) > p I (a, A) and p I (b, A) < p I (b, A). Then it must be that u(a) > u(b). To show the latter, assume by contradiction that u∗ ≤ u(a) < u(b), then F˜I (u(a), A) < F I (u(a), A) which leads to a contradiction. Corollary 2.53 Under the conditions of Proposition 2.52, u(a) ≥ u(b) if and only if ˜ EuI (A) ≥ EuI (A). 2.A.10 Proof of Proposition 2.28 Proof. First we prove that (1) implies (2). If a complete data has a FDSM without full support representation then there is a triple (u, u∗ , ΓX ), take a realization of ΓX with support on some subset or all R X and call it rX , then define the linear ordering on X 121 X : (1) a X b if arX b and a, b ∈ U ∗ , (2) a X b if u(a) > u(b) and a, b ∈ / U ∗ and (3) / U ∗ . Now, assign this linear ordering X the probability ΓX (rX ). It a X b if a ∈ U ∗ , b ∈ is direct to see that there is a Random Utility Maximization model without indifference with realizations X with probability ΓX (rX ). By (Block and Marschak 1960) it follows that the generated data set (D, p) satisfies Total Monotonicity (Axiom 2.16). Now we prove that (2) implies (1). If a complete stochastic choice dataset (D, p) satisfies Axiom 2.16 then the model is a Random Maximization Utility model so we can recover a distribution over linear orders on X ΓX : RX 7→ [0, 1] such that p(a, A) = ΓXQ (ar X b ∀ b ∈ A) thanks to (Falmagne 1978). Interpret ΓX as a fixed distribution search order, and select any injective u : X → R and u∗ such that u(x) > u∗ for all x ∈ X . By construction this triple (u, u∗ , ΓX ) generates the observed data. 2.A.11 Proof of Theorem 2.32 Proof. First we prove that (1) implies (2). To show that Axiom 2.10 holds, assume, by the way of contradiction, that (i) a is stochastically revealed preferred to b and (ii) b is stochastically strictly revealed preferred to a. From (ii), given that data admits a FSSMI we must have (by the full support assumption) that u(a) < u∗ and u(a) < u(b). If a is stochastically revealed preferred to b then there must exist a sequence of alter- natives c1 , ..cN and choice sets A1 , ..., AN −1 , such that c1 = a, cN = b, cn , cn+1 ∈ An and cn ∈ C(An ). If u(b) > u∗ then it must be the case that u(cN −1 ) > u∗ (otherwise cN −1 could not be chosen with positive probability when cN was available). Iterating on this argument implies that u(a) > u∗ . If u(b) < u∗ then this implies that u(cN −1 ) > u(b). If u(cn ) < u∗ for all n then iterating on this argument implies that u(a) ≥ u(b) . Other- wise, the previous argument implies that u(a) > u∗ . Either provides a contradiction. Now we prove that (2) implies (1). We say two items are revealed stochastically in- 122 different if a is revealed stochastically preferred to b and b is revealed stochastically pre- ferred to a, in that case we denote aI ∗ b. We modify this relation I ∗ ⊆ X ×X by removing its elements that have at least one item from always chosen set W ∗ that is non-empty by SARP. Formally we define the relation I = {(a, b) ∈ X ×X : (a, b) ∈ I ∗ a ∈ X \W ∗ or b ∈ X \W ∗ } ∪ D(W ∗ ) where D(W ∗ ) is the diagonal ordering in W ∗ (i.e., it contains only the elements (a, a) ∈ W ∗ × W ∗ ). I is an equivalence relation because it is reflexive, symmetric and transitive. Because Axiom 2.10 holds I ∗ is an equivalence relation, and I is still an equivalence relation because it only eliminates the indifference of the items in W ∗ except for reflexivity, namely the elements (a, a) ∈ I ∗ for a ∈ W ∗ . By Axiom 2.10 and the definition of W ∗ no item in X \W ∗ is revealed indifferent to an item in W ∗ . The relation I induces an equivalence class that we denote as [a]. We concentrate on the quotient set X I = X /I, we define the canonical projection j : X 7→ X /I and its inverse mapping j −1 : X /I 7→ X . We let D I ≡ { j(A)}A∈D be the indexed set by D. In particu- lar define p I : X I × D I 7→ [0, 1] as p(a I , AI ) = a∈ j −1 (a I )∩A p(a, A) for A ∈ D such that P j(A) = AI and a I ∈ AI , this mapping is well defined. If Axiom 2.10 holds it follows that the quotient dataset {p(a I , AI )}a I ∈X I ,AI ∈D I also satisfies SARP. Also observe, that in the quotient dataset the always chosen set W I,∗ = {a I ∈ X I : p(a I , AI ) > 0 ∀AI ∈ D I } is such that j −1 (a I ) ∈ W ∗ for all a I ∈ W I,∗ , this follows from the construction of I because the equivalence classes in W ∗ are singletons. Observe also that Axiom 2.8 holds in the quotient dataset {p(a I , AI )}a I ∈X I ,AI ∈D I , because if b I ∈ X I \W I,∗ then either p(b I , AI ) = 0 or (exclusively) p(b I , AI ) = 1. In fact, the set X I is a finite choice set with elements asso- ciated with degenerates probabilities of choice p([a] ∩ A, A) = a∈[a]∩A p(a, A) ∈ {0, 1} P if [a] ⊆ X \W ∗ . To see this is true assume that p([a] ∩ A, A) ∈ (0, 1) and [a] ⊆ X \W ∗ , this means that there is a third element c ∈ A such that c ∈ A\[a], that is stochastically revealed preferred to all a ∈ [a] (i.e, p(c, A) > 0) and of course all elements a ∈ [a] are stochastically revealed preferred to c, but that means that aI c for all a ∈ [a] which means that c ∈ [a], this is a contradiction. By theorem 2.11 we conclude that the 123 quotient dataset {p(a I , AI )}a I ∈X I ,AI ∈D I can be generated by a FSSM without indifference, thus we build a triple (u, u∗ {γ}AI ∈D I ), that generates the quotient dataset. With this in hand we build the FSSMI in the actual dataset {p(a, A)}a∈x,A∈D . (i) We build a utility function u : X 7→ R, by the composition u = u ◦ j where j is the canonical projection defined above. By construction u(a) > u∗ for all a ∈ W ∗ and u(a) = u(b) if b ∈ [a]. Moreover, u(b) < u∗ for all b ∈ X \W ∗ . (ii) The search probabilities are defined over the quotient set, we build search proba- bilities for the actual set X . γ defines a full support search distribution on each AI ∈ D I . Now we define γ by the following algorithm: For each menu A ∈ D, we obtain the menu AI ≡ j(A) ∈ D I in the quotient dataset, then take RAI the support of γAI , now for any element rAI ∈ RAI define the restriction rAI |W I,∗ . Now build the set of linear search orders on A RA(rAI |W I,∗ ) = {rA ∈ RA : a, b ∈ W ∗ , arA b j(a)rAI |W I,∗ j(b)}. We assign if to each element rA ∈ RA(rAI |W I,∗ ), the probability γA(rA) = r 0 ∈R I γAI (rA0 I )1(rA0 I |W I,∗ = P AI A rAI |W I,∗ )/|RA(rAI |W I,∗ )|. This construction provides as with {γA}A∈D that defines a FS random linear ordering on each D. (iii) To build the menu dependent tie breaking rules we calibrate them as follows T (a, A) = p(a, A) if a ∈ X \W ∗ and a ∈ [a] such that p([a]∩A, A) = 1. If p([a]∩A, A) = 0 then we let T (a|A∼a ) = 1/|[a] ∩ A|. This guarantees a tie breaking rule that is always positive and that adds up to 1 as required. We have generated a tuple (u, u∗ , {γA}A∈D , T ) or a FSSMI representation that gener- ates the complete dataset {p(a, A)}a∈X ,A∈D . To verify this claim, notice that this follows immediately from applying Theorem 2.11 to generate the quotient dataset, and noticing that for non-satisficing elements we can generate the actual dataset using the calibrated 124 tie breaking rule directly and observing that the elements in W I,∗ have a one to one cor- respondence to the elements in W ∗ . 2.A.12 Proof of Theorem 2.33 Proof. (1)-(3) follows from Theorem 2.18. To prove (4) notice that, given U ∗ = U ¯ ∗, / U ∗ , if T (a|A∼a ) 6= T¯ (a|A∼a ) then, from the definition of the model p(a, A) 6= for all a ∈ p(a, A). 2.A.13 Proof of Theorem 2.34 Proof. To prove (1) notice that, since we do not allow for ties, p(a, A) ∈ (0, 1) if a ∈ U ∗ f ⊆ U ∗ . Moreover if a ∈ U ∗ then, given the full support assumption, p(a, A) > 0 then W for all A ∈ D then a ∈ W ∗ as in Theorem 2.11. (2) follows from Axiom 2.10. Notice that SARP does not requires complete data sets to guarantee the existence of a utility function that represents the revealed preference relation. (3) follows from the definition of the model. Note that identification is only possible when surely revealed satisficing elements are available together in a menu. That is, we can identify the probability of a seen first than b in A ∈ D, i.e. γA(arA b ∀b ∈ (A ∩ U ∗ )\{a}) if a ∈ A ∩ U ∗ . Moreover notice that if this is the case, then a, b ∈ W ˜. 125 2.A.14 Proof of Lemma 2.39 ˆ (a, A, ω), we first need Proof. In order to prove the total monotonicity condition for P to define the following function for all i ∈ {1, · · · , I} for each A ∈ D and a ∈ A: X f i (a, A) = (−1)|D\A| ˆpi (a, D) D∈B(A) where B(A) is the class of supersets of A (i.e., B(A) ≡ {D ∈ D|A ⊆ D}). Now define X f (a, A) = (−1)|D\A| P ˆ (a, A, ω) D∈B(A) Observe that X I X f (a, A) = (−1)|D\A| ωi ˆpi (a, A) D∈B(A) i=1 by definition. We can interchange the first summation operator with the the second summation operator because the first does not depend on i, then I X X f (a, A) = ωi (−1)|D\A| ˆpi (a, A) i=1 D∈B(A) which implies: I X f (a, A) = ωi f i (a, A), i=1 by the assumption that each of the mappings ˆpi is totally monotonic we have that f i (a, A) ≥ 0 for all i ∈ {1, · · · , I} thus establishing the result. 2.A.15 Proof of Lemma 2.40 Before the proof of Lemma 2.40 we need the following preliminaries. Remark 2.54 It will be useful to notice that the FDSM-CRT can be written in the following 126 form: X p(a, A) = p rMA=r M X |A,τ (a, A)γ(rX ) + pτM M (o, A)1(u(a) > u(b)∀b ∈ A\{a}), r x ∈R X (a, A) = τ(a) b∈A:brA a;rA=rX |A(1 − τ(b)) is numerically equivalent to the Q where p rM=r M |A,τ A X MM probability of a ∈ A being chosen and rX |A is the restriction of the ordering rX to the set A. And pτM M (o, A) = c∈A(1 − τ(c)) is numerical equivalent to the MM probability of Q the default alternative. Here there is no default alternative and we mean by p M M (o, A) the probability of no element in A being satisficing (keeping the MM notation for intuition). Definition 2.55 (Successive differences for mappings) The successive differences for a mapping ˆp : X × D 7→ [0, 1] for a fixed a ∈ A and A ∈ D and the probability ˆp(a, A), is defined recursively as: ∆A1 ˆp(a, A) = ˆp(a, A) − ˆp(a, A ∪ A1 ) for A, A1 ∈ D, ∆An · · · ∆A1 ˆp(a, A) = ∆An−1 · · · ∆A1 ˆp(a, A) − ∆An−1 · · · ∆A1 ˆp(a, A ∪ An ) for all n ≥ 2 and for all A, A1 · · · An ∈ D. Definition 2.56 (Weakly increasing successive differences) For a mapping ˆp : X × D 7→ [0, 1] and a fixed a ∈ X , and all A ∈ D such that a ∈ A and any {Ai }ni=1 ∈ D n the mapping ˆp satisfies the weakly increasing successive difference property when the successive differences are non-negative ∆An · · · ∆A1 ˆp(a, A) ≥ 0 for all n ≥ 1. Proof. Notice that any mapping ˆp : X × D 7→ [0, 1] for a fixed (a, A) such that a ∈ A for all A ∈ D is totally monotone if and only if it satisfies the weakly increasing succes- sive differences ((Molchanov 2005)). ˆ (a, A) = T (A)1(u(a) > u(b)∀b ∈ A\{a}) where T (A) = c∈A(1 − τ(c)) Q Define P is totally monotone by MM (since T (A) = p M M ,τ (o, A) is numerically equivalent to 127 the probability of a default choice in MM as explained in the remark) and thus sat- isfies for all A ∈ D such that a ∈ A and any {Ai }ni=1 ∈ D n the weakly increasing successive differences ∆An · · · ∆A1 (1 − P(A)) ≥ 0 for all n ≥ 1, the same is true for Ca (A) = 1(u(a) > u(b)∀b ∈ A\{a}) such that ∆An · · · ∆A1 1(u(a) > u(b)∀b ∈ A\{a}) ≥ 0. Notice that ∆A1 P ˆ (a, A) = P ˆ (a, A) − P ˆ (a, A ∪ A1 ) = T (A)Ca (A) − T (A ∪ A1 )Ca (A ∪ A1 )  Ca (A) = 0     0   = ∆A1 T (A) Ca (A ∪ A1 ) = 1      T (A)  Ca (A) = 1, Ca (A ∪ A1 ) = 0 ˆ (a, A) = T (A)Ca (A) − T (A ∪ A1 )Ca (A ∪ A1 ) − Also for the next difference ∆A2 ∆A1 P [T (A ∪ A2 )Ca (A ∪ A2 ) − T (A ∪ A1 ∪ A2 )Ca (A ∪ A1 ∪ A2 )]. Now ˆ (a, A) = ∆A P ∆A2 ∆A1 P 1 ˆ (a, A) − ∆A P 1 ˆ (a, A ∪ A2 )  0 Ca (A) = 0        ∆A1 T (A) Ca (A ∪ A1 ) = 1, Ca (A ∪ A2 ) = 0       = T (A) Ca (A) = 1, Ca (A ∪ A1 ) = 0, Ca (A ∪ A2 ) = 0 .     ∆A2 T (A) Ca (A) = 1, Ca (A ∪ A2 ) = 1, Ca (A ∪ A1 ) = 0        ∆ ∆ T (A) C (A ∪ A ∪ A ) = 1  A2 A1 a 1 2 Then we notice that in general the following properties hold: 128 (i) Null operator: ∆; ∆An−1 · · · ∆A1 T (A) = 0, (ii) Absorption: ∆An−1 ∆An−1 · · · ∆A1 T (A) = ∆An−1 · · · ∆A1 T (A). ∆A1 P ˆ (a, A) ∈ {∆X T (A)} for X 1 ∈ {;, A1 } and its value depends on the value of Ca (A) 1 and Ca (A ∪ A1 ) but in any case we have ∆X 1 T (A) ≥ 0 for all X 1 ∈ {;, A1 }. Notice that ∆; T (A) = 0. ˆ (a, A) ∈ {∆X ∆X T (A)} where X 2,1 ∈ {;, A1 , A2 } and X 2,2 ∈ {;, A1 , A2 } Also, ∆A2 ∆A1 P 2,2 2,1 where the actual combination depends on the values of the vector {Ca (A ∪ X 2,1 ∪ X 2,2 )}X 2,1 ∈{;,A1 ,A2 },X 2,2 ∈{;,A1 ,A2 } again in any case: ∆X 2,2 ∆X 2,1 T (A) ≥ 0 where X 2,1 ∈ {;, A1 , A2 } and X 2,2 ∈ {;, A1 , A2 }. Notice that ∆; ∆X 2,1 T (A) = ∆X 2,1 T (A) − ∆X 2,1 T (A) = 0, ∆A2 ∆A2 T (A) = ∆A2 T (A) − ∆A2 T (A ∪ A2 ) = ∆A2 T (A). The induction hypothesis is: ∆An−1 · · · ∆A1 P ˆ (a, A) ∈ {∆X n−1,n−1 · · · ∆X n−1,1 T (A)} for X n−1,i ∈ {;, A1 , A2 , · · · An−1 } for all i ∈ {1, · · · , n − 1}, with ∆X n−1,n−1 · · · ∆X n−1,1 T (A) ≥ 0 for any X n−1,i ∈ {;, A1 , A2 , · · · An−1 } for all i ∈ {1, · · · , n− 1}. 129 We have to prove that ∆An · · · ∆A1 P ˆ (a, A) ≥ 0, Notice that ∆An · · · ∆A1 P ˆ (a, A) = ∆A · · · ∆A P n−1 1 ˆ (a, A) − ∆A · · · ∆A P n−1 1 ˆ (a, A ∪ An ) by definition. Now by the induction step: ˆ (a, A) ∈ {∆X ∆An · · · ∆A1 P n−1,n−1 · · · ∆X n−1,1 T (A)−∆X n−1,n−1 · · · ∆X n−1,1 T (A∪An )} for X n−1,i ∈ {;, A1 , A2 , · · · An−1 } for all i ∈ {1, · · · , n − 1}. Finally by the definition of the difference operator: ∆X n−1,n−1 · · · ∆X n−1,1 T (A) − ∆X n−1,n−1 · · · ∆X n−1,1 T (A ∪ An ) = ∆An ∆X n−1,n−1 · · · ∆X n−1,1 T (A). We notice that ∆An ∆X n−1,n−1 · · · ∆X n−1,1 T (A) ≥ 0 is positive by total monotonicity of T , that preserves works for any combination of X n−1,i ∈ {;, A1 , A2 , · · · An−1 } for all i ∈ {1, · · · , n − 1}. Thus ∆An · · · ∆A1 P ˆ (a, A) ≥ 0 for any fixed a ∈ A and all such A ∈ D and for all {Ai }ni=1 ∈ D n for all n ≥ 1. ˆ is a total monotone mapping. We conclude that P 130 2.A.16 Proof of Lemma 2.36 Proof. (I) If the data is generated by FSSM-CRT then it satisfies SARP. First notice that if τ(a) > 0 then it follows under full support that p(a, A) > 0. Furthermore, by full support and common threshold we observe also that p(a, A) > 0 when there is a b ∈ A\{a} such that τ(b) > 0 only if τ(a) > 0. Notice that if there is a b ∈ A\{a} with τ(b) > 0 and τ(a) = 0, then by the common threshold and monotonic- ity of a CDF Fu∗ we know that it cannot be the case that u(a) > u(b), thus p(a, A) = 0 (because τ(a) = 0 implies the term 1(u(a) > u(b) ∀ b ∈ A\{a}) = 0) under the CRT assumption). To show that Axiom 2.10 holds, assume, by the way of contradiction, that (i) a is stochastically revealed preferred to b and (ii) b is stochastically strictly revealed pre- ferred to a. From (ii), given that data admits a FSSM-CRT we must have (by the full support assumption) that τ(a) = P r(u(a) > u∗ ) = 0 and u(a) < u(b). If a is stochas- tically revealed preferred to b then there must exist a sequence of alternatives c1 , ..cN and choice sets A1 , ..., AN −1 , such that c1 = a, cN = b, cn , cn+1 ∈ An and cn ∈ C(An ). If τ(b) = P r(u(b) > u∗ ) > 0 then it must be the case that τ(cN −1 ) = P r(u(cN −1 ) > u∗ ) > 0 (otherwise cN −1 could not be chosen with positive probability when cN was available). Iterating on this argument implies that τ(a) = P r(u(a) > u∗ ) > 0. If τ(b) = P r(u(b) > u∗ ) = 0 then this implies that u(cN −1 ) > u(b). If P r(u(cn ) > u∗ ) = 0 for all n then iterating on this argument implies that u(a) > u(b) . Otherwise, the previous argument implies that τ(a) = P r(u(a) > u∗ ) > 0. Either provides a con- tradiction. (II) A dataset that has a FSSM-CRT representation satisfies the Deterministic no sat- 131 isficing choice axiom. Observe that W ∗ = {a ∈ X |τ(a) = P r(u(a) > u∗ ) > 0} under full support, in that sense when a ∈ X \W ∗ and we have a menu A such that a ∈ A: (i) Either there is a b ∈ W ∗ ∩ A in which case u(b) > u(a) because τ(b) > τ(a) = 0 only if u(b) > u(a). This is because P r(u(b) > u∗ ) > 0 = P r(u(a) > u∗ ) only if u(b) > u(a) thus p(a, A) = 0 (under the common random threshold). (ii) Or there is no b ∈ W ∗ ∩ A in which case 1 − P(A) = 1, thus making the probability p(a, A) = 1(u(a) > u(c) ∀ c ∈ A) which is by definition either p(a, A) = 0 or p(a, A) = 1. 2.A.17 Proof of Theorem 2.37 Proof. First we prove that (1) implies (2). ∗ If a dataset is generated by a FSSM with parameters {{γA}A∈D , u, u } then it can be ∗ ∗ generated by a FSSM-CRT {{γA}A∈D , uRT , τ} with τ(a) = P r(u(a) > u ) where u is a constant random variable, such that τ(a) = 1 for all elements in FSSM such that ∗ ∗ u(a) > u and τ(b) = 0 for u(b) < u , with the same utility uRT ≡ u and the same random search function γRT A ≡ γA. In other words, FSSM is a special case of FSSM-CRT with a constant threshold. Now we prove that (2) implies (1). If a dataset is generated by FSSM-CRT by Lemma 2.36 it satisfies satisfies Determin- istic no satisficing choice (Axiom 2.8), SARP (Axiom 2.10). Thus by Theorem 2.11 we can build an FSSM that generates the data. Thus (2) implies (1). 132 2.A.18 Proof of Lemma 2.41 Proof. (I) If the data is generated by FDSM-CRT then p(a, A) satisfies total monotonic- ity for all a ∈ A and all A ⊆ X . (a, A) = τ(a) b∈Br (a) (1− Q We define the MM model for a fixed linear search rA as p rMA,τ M A τ(b)) and pτ (o, A) = c∈A(1 − τ(c)). Now notice that the FDSM-CRT can be written MM Q as the average of the MM models, plus a correction for non satisficing cases. X p(a, A) = p rMA=r M X |A,τ (a, A)γ(rX ) + pτM M (o, A)1(u(a) > u(b) ∀ b ∈ A\{a}) r x ∈R X It is evident by MM that p rMA,τ M (a, A) satisfies total monotonicity for all rA ∈ RA there- fore by Lemma 2.39 we know that a summation (here a convex combination) of total (a, A)γ(rX ) P monotonic mappings also satisfies total monotonocity, thus the term r x ∈RX p rM=r M |A,τ A X satisfies total monotonicity. 12 The same argument is used for pτM M (o, A) that by MM is also totally monotonic. Also, notice that the indicator function 1(u(a) > u(b) ∀b ∈ A\{a}) satisfies to- tal monotonicity because it can be understood a degenerate random utility distribu- tion. We know also that pτM M (o, A)1(u(a) > u(b)∀b ∈ A\{a}) is totally monotonic by Lemma 2.40.13 12 Notice that ˆp(a, A) = r x ∈RX p rM=r (a, A)γ(rX ) satisfies total monotonicity however this term is a P M A X |A,τ quasi-probability because it does not necessarily adds up to 1, so it is not a RUM in general. RUM is equivalent to total monotonicity, being non negative and adding up to 1. 13 A non-constructive but equally valid argument is that 1 − pτM M (o, A)1(u(a) > u(b) ∀b ∈ A\{a}) can be understood as the capacity of a random set that is the union of two other two independent random sets with capacities 1 − pτM M (o, A) and (1 − 1(u(a) > u(b)∀b ∈ A\{a})) corresponding to the random sets Wˆ = {x ∈ X : u(a) > u∗ } where u∗ is the random threshold and B(a) = {b ∈ X : u(b) > u(a)} that is the deterministic set of better than a items under the utility u. By (Molchanov 2005) pτM M (o, A)1(u(a) > u(b) ∀b ∈ A\{a}) is totally monotone, and pτM M (o, A)1(u(a) > u(b) ∀b ∈ A\{a}) = P r(W ˆ ∪ B(a) ∩ A = ;) the probability that A does not have anything satisficing or better than a. 133 Since p(a, A) is the summation of two totally monotonic mappings again by Lemma 2.39 we conclude that p(a, A) satisfies total monotonicity. (II) If the data is generated by FDSM-CRT then it satisfies SARP. This follows from the fact that FDSM-CRT is an special case of FSSM-CRT and by Lemma 2.36 we know that it satisfies SARP. (III) A dataset that has a FDSM-CRT representation satisfies the Deterministic no satisficing choice axiom. This follows from the fact that FDSM-CRT is an special case of FSSM-CRT and by Lemma 2.36 we know that it satisfies the Deterministic no satisficing choice axiom. 2.A.19 Proof of Theorem 2.42 Proof. First we prove that (1) implies (2). ∗ If a dataset is generated by a FDSM with parameters {{γA}A∈D , u, u } then it can be ∗ ∗ generated by a FDSM-CRT {{γA}A∈D , uRT , τ} with τ(a) = P r(u(a) > u ) where u is a constant random variable, such that τ(a) = 1 for all elements in FDSM such that ∗ ∗ u(a) > u and τ(b) = 0 for u(b) < u , with the same utility uRT ≡ u and the same random search function γRT A ≡ γA. In other words, FDSM is a special case of FDSM-CRT with a constant threshold. Now we prove that (2) implies (1). If a dataset is generated by FDSM-CRT by Lemma 2.41 it satisfies satisfies Deter- ministic no satisficing choice (Axiom 2.8), SARP (Axiom 2.10) and Total Monotonicity (Axiom 2.16). Thus by Theorem 2.17 we can build an FDSM that generates the data. 134 Thus (2) implies (1). 2.A.20 Proof of Theorem 2.48 Proof. First we prove that (1) implies (2). First we recall that for a fixed u ∈ U , puF DSM (a, A) satisfies Total Monotonicity (Ax- iom 2.16). By Lemma 2.39, the weighted average of totally monotonic mappings is also totally monotone p(a, A) = u∈U ρ(u)puF DSM (a, A), thus RU-FDSM is totally monotone. P Second we notice that if x ∈ X \W ∗ it is never satisficing that is for all u ∈ U with ρ(u) > 0, u(x) < u∗ , this means that for any fixed u ∈ U , puF DSM (x, A) = 0 if A∩ W ∗ 6= ;, this in turn implies that p(x, A) = u∈U ρ(u)puF DSM (x, A) = 0. Thus degeneracy is es- P tablished for RU-FDSM. Now we prove that (2) implies (3). With Claim 2.45 in hand, we define the following virtual dataset. We define the equivalence class ∼= {(a, b) ∈ X × X : a ∈ X \W ∗ , b ∈ X \W ∗ } ∪ {(c, c) ∈ X × X : c ∈ W ∗ } (it is symmetric, reflexive and transitive). We define the set X ∼ = X / ∼ as the quotient space with respect to the equiva- lence relation. In words, we are “shrinking” all never satisficing elements to a sin- gleton. We concentrate on the quotient set X ∼ = X / ∼, we define the canonical projection j : X 7→ X / ∼ and its inverse mapping j −1 : X / ∼7→ X . We let D ∼ ≡ { j(A)}A∈D be the indexed set by D. In particular define p∼ : X ∼ × D ∼ 7→ [0, 1] as p(a∼ , A∼ ) = a∈ j −1 (a∼ )∩A p(a, A) for A ∈ D such that j(A) = A∼ and a∼ ∈ A∼ , this mapping P is well defined. If Degeneracy (Axiom 2.44) holds it follows that the quotient dataset 135 {p(a∼ , A∼ )}a∼ ∈X ∼ ,A∼ ∈D ∼ satisfies Deterministic not satisficing choice (Axiom 2.8) in X ∼ , because defining W ∗∼ = {a∼ ∈ X ∼ |p∼ (a∼ , A∼ ) > 0 ∀ A∼ ∈ D ∼ }, we notice X ∼ \W ∼∗ cor- responds by construction to a singleton {z ∼ } ≡ X ∼ \W ∼∗ when there is non-satisficing element (or empty if not) with probability p(z ∼ , A∼ ) = a∈ j −1 (z)∩A p(a, A) ∈ {0, 1} in all P A∼ ∈ D ∼ . In particular, it is exactly 1 only when A∼ ≡ {z ∼ } and zero otherwise. Also no- tice that Degeneracy implies SARP in {p(a∼ , A∼ )}a∼ ∈X ∼ ,A∼ ∈D ∼ , because all x ∼ , y ∼ ∈ W ∗∼ are declared stochastically revealed preferred to one another (i.e., stochastically re- vealed indifferent) and x ∼ ∈ W ∗∼ is always strictly revealed preferred to z ∼ (or the non-satisficing singleton {z ∼ } ≡ X ∼ \W ∼∗ ). Thus C ∼ (A∼ ) = {a∼ ∈ A∼ : p∼ (a∼ , A∼ ) > 0} satisfies SARP. Also it is simple to see that the dataset in the quotient space X ∼ also satisfies total monotonicity because p(a∼ , A∼ ) = a∈ j −1 (a∼ )∩A p(a, A) for A ∈ D such that P j(A) = A∼ and a∼ ∈ A∼ is a sum of total monotonic mappings p(a, A) by (2) Total Mono- tonicity (Axiom 2.16). By the theorem that characterizes the FDSM (Theorem 2.17) the dataset {p(a∼ , A∼ )}a∼ ∈X ∼ ,A∼ ∈D ∼ in the quotient space has a FDSM representation, thus we build a triple (u, u∗ , {γ}A∼ ∈D ∼ ), that generates the quotient dataset. With this in hand we build the RU-FDSM in the actual dataset {p(a, A)}a∈X ,A∈D . (i) We build a utility function u : X 7→ R, by the composition u = u ◦ j where j is the canonical projection defined above. By construction u(a) > u∗ for all a ∈ W ∗ and u(b) < u∗ for all b ∈ X \W ∗ . We build now (ρ, U ), by making all elements in it have ˆ ∈ U with ρ(ˆ the following restriction u u) > 0 is such that u ˆ(a) = u(a) for all a ∈ W ∗ . For b ∈ X \W ∗ we use total monotonicity (Axiom 2.16) that holds for the restricted dataset {p(a, A)}a∈X \W ∗ ,A∈D,A⊆X \W ∗ to obtain a random utility by (Falmagne 1978) de- fined on X \W ∗ , we notice that under this p(a, A) = ρ(ˆ ˆ(a) > u u:u ˆ(b)∀b ∈ A\{a}) for ˆX \W ∗ : X \W ∗ 7→ R with ρ(ˆ a ∈ A ⊆ X \W ∗ , we fix an injective utility u uX \W ∗ ) compatible with ρ(ˆ ˆX \W ∗ (a) > u uX \W ∗ : u ˆX \W ∗ (b)∀b ∈ A\{a}) with A ⊆ X \W ∗ (we omit the trivial ˆX \W ∗ : X \W ∗ 7→ R to the grand set X , by defining u construction). We extend u ˆ : X 7→ R 136 ˆ(a) = u(a) for all a ∈ W ∗ and u using u ˆ(b) = u ˆX \W ∗ for all b ∈ X \W ∗ , this is an injective utility in X with mass ρ(ˆ u) = ρ(ˆ uX \W ∗ ). We have built (ρ, U ) that has the Always sat- isficing random utility property. (ii) The search probabilities are defined over the quotient set, we build search proba- bilities for the actual set X . γ defines a full support search distribution on each A∼ ∈ D ∼ . Now we define γ by the following algorithm: For each menu A ∈ D, we obtain the menu A∼ ≡ j(A) ∈ D ∼ in the quotient dataset, then take RA∼ the support of γA∼ , now for any el- ement rA∼ ∈ RA∼ define the restriction rA∼ |W ∼∗ . Now build the set of linear search orders on A RA(rA∼ |W ∼,∗ ) = {rA ∈ RA : a, b ∈ W ∗ , arA b j(a) rA∼ |W ∼∗ j(b)}. We assign to if each element rA ∈ RA(rA∼ |W ∼∗ ), the probability γA(rA) = rA∼0 ∈RA∼ γA∼ (rA∼0 )1(rA∼0 |W ∼∗ = P rA∼ |W ∼∗ )/|RA(rA∼ |W ∼∗ )|. This construction provides as with {γA}A∈D that defines a Full Support random linear ordering on each D and by Total Monotonicity (Axiom 2.16) γ has the Fixed Distribution property and we build γ on the basis on it, extending it to the original dataset with a uniform rule, it follows that {γA}A∈D has the Fixed Distribution property. It is clear that for fixed u ∈ U constructed above with ρ(u) > 0, we have an FDSM with (u, u∗ , {γA}A∈D }) where u, {γA}A∈D are constructed above and u∗ is the same as the FDSM in the quotient space. Thus we have built a RU-FDSM X p(a, A) = ρ(u)puF DSM (a, A) u∈U with FDSM with (u, u∗ , {γA}A∈D }) with the always satisficing property for (ρ, U ). Finally, this generates the dataset (p, D) by noticing that for a ∈ W ∗ p(a, A) = puF DSM (a, A) for any fixed u ∈ U with ρ(u) > 0 (thanks to the always satisficing property), so the fact that the constructed RU-FDSM (with always satisficing property) generates this proba- 137 bilities follows from Theorem 2.17. For a ∈ X \W ∗ , we have p(a, A) = 1(∀c ∈ A : u(c) ≤ u∗ )ρ(u : u(a) > u(b)∀b ∈ A\{a}) for any fixed u ∈ U with ρ(u) > 0 (again due to the always satisficing property), we have two cases 1(∀c ∈ A : u(c) ≤ u∗ ) = 1, in which case the fact that the construted RU-FDSM generates the data follows from (Falmagne 1978) as these cases are equivalent to random utility in the restricted dataset defined in X \W ∗ . The remaining case is 1(∀c ∈ A : u(c) ≤ u∗ ) = 0 in which case we know that ∃c ∈ A such that c ∈ W ∗ and by Degenaracy (Axiom 2.44) we conclude that p(a, A) = 0 as it should be. Thus the constructed RU-FDSM generates the data. Finally we prove (3) implies (1). We notice that if complete stochastic choice dataset (D, p) can be generated by a RU-FDSM with Always satisficing random utility distribution, it is by definition a special case of RU-FDSM, with the additional Always satisficing restriction on (ρ, U ) thus the dataset (D, p) is also generated by a RU-FDSM. Conclusion Chapter 1 proposes a predictive ability approach for the assessment of the performance of a model when this is described by a set of axioms. Following this approach, I judge the model for how useful it is in terms of providing precise predictions. Intuitively, higher precision comes as a result of lots of information about preferences that can be inferred from data while requiring small errors to satisfy the model. Therefore, the predictive ability approach establishes a natural trade-off between fit and power conditional on observed behavior, providing a meaningful answer to the fit and power problems, a long standing problem in the literature. Moreover, by considering conditional power the proposed approach reflects also the effect of the actual pattern of choices observed, and I show that traditional power masks significant differences on the amount of infor- mation that it can be learned from data to predict behavior. Additionally, the construction provided for predictive distribution can be used to forecast consumer’s responses to policy or market interventions. I study the perfor- mance of the proposed measures with the experimental data from Choi et al (2007). As expected my measures are positively correlated with the amount of preference infor- mation that can be learned from data and with other measures of fit in the literature only through the error process. I also show that more observations enhance predictive pre- cision -the additional revealed information outweighs the potential increase in errors-. Finally, I find that the non-parametric utility maximization model provides more precise 138 139 predictions than a model that assumes Cobb-Douglas preferences: the additional pre- cision of the latter model is outweighed by the larger error process needed to fit the data. The proposed predictive precision measures are shown to be related to the extent of the identification of the systematic component of the process that generated the data. Therefore these measures can be "inverted" to optimally design an experiment to test for a given set of candidate models. An avenue for future research is to produce an efficient way to conduct experimental design based on these measures, which can be of greatly interest for the experimental field. Concretely this project requires: (i) provid- ing a general, computationally feasible, algorithm for the construction of the predictive distribution; (ii) streamlining the connection between power and predictive precision for general context and providing a feasible algorithm for experimental selection and (iii) studying the finite samples and asymptotic properties of these measures for general models of behavior. The proposed approach to assess the performance of a model is a broad approach that can be applied to a myriad of behavioral models. However, successfully applying this approach to other models, that may deliver empirical implications in more general contexts, depends on the extension of the predictive distribution to these contexts. In general, the predictive distribution will reflect the two types of uncertainty described above. However, the nature of the error process may differ for different economic en- vironments. For example, the error process should exhibit a different structure when dealing with preferences over lotteries, menus, sets of discrete alternatives or prefer- ences over consumption bundles. One could think as defining the error process for discrete choices as the minimal number of swaps required to make consistent the re- vealed preference information -in the case of utility maximization- inferred from data, as in (Apesteguia and Ballester 2013). 140 Chapter 2 develops necessary and sufficient conditions for stochastic choice data to be consistent with satisficing, assuming that preferences are fixed, but search or- der may change randomly. The model predicts that stochastic choice can only occur amongst elements that are always chosen, while all other choices must be consistent with standard utility maximization. Adding the assumption that the probability distri- bution over search orders is the same for all choice sets makes the satisficing model a subset of the class of random utility models. Bibliography ADAMS, A. (2013): “Globally Rational Nonparametric Demand Forecasts,” . AFRIAT, S. N. (1967): “The Construction of Utility Functions from Expenditure Data,” International Economic Review, 8(1), pp. 67–77. AKAIKE, H. (1998): “Information theory and an extension of the maximum likelihood principle,” in Selected Papers of Hirotugu Akaike, pp. 199–213. Springer. AKERLOF, G. A. (1970): “The Market for "Lemons": Quality Uncertainty and the Market Mechanism,” The Quarterly Journal of Economics, 84(3), 488–500. ANDREONI, JAMES, G. B., AND W. HARBAUGH (2013): “The power of revealed preference tests: ex-post evaluation of experimental design,” Working paper. ANDREONI, J., AND J. MILLER (2002): “Giving according to GARP: An experimental test of the consistency of preferences for altruism,” Econometrica, 70(2), 737–753. ANDREWS, D. W., AND P. GUGGENBERGER (2009): “Validity of Subsampling and "Plu-in Asymptotic" inference for parameters defined by moment inequalities,” Econometric Theory, 25, 669–709. APESTEGUIA, J., AND M. A. BALLESTER (2013): “A measure of rationality and welfare,” Working Papers (Universitat Pompeu Fabra. Departamento de Economía y Empresa). BEATTY, T. K., AND I. A. CRAWFORD (2011): “How demanding is the revealed preference approach to demand?,” The American Economic Review, 101(6), 2782–2795. 141 142 BECKER, G. S. (1962): “Irrational behavior and economic theory,” The Journal of Political Economy, 70(1), 1–13. BLACKWELL, D., ET AL . (1951): “Comparison of experiments,” in Proceedings of the second Berkeley symposium on mathematical statistics and probability, vol. 1, pp. 93–102. BLOCK, H. D., AND J. MARSCHAK (1960): Contributions to Probability and Statisticschap. Random Orderings and Stochastic Theories of Response, in (Olkin 1960). BLUNDELL, R., M. BROWNING, AND I. CRAWFORD (2008): “Best Nonparametric Bounds on Demand Responses,” Econometrica, 76(6), 1227–1262. BLUNDELL, R., D. KRISTENSEN, AND R. MATZKIN (2014): “Bounding quantile demand functions using revealed preference inequalities,” Journal of Econometrics, 179(2), 112–127. BLUNDELL, R. W., M. BROWNING, AND I. A. CRAWFORD (2003): “Nonparametric Engel Curves and Revealed Preference,” Econometrica, 71(1), pp. 205–240. BRONARS, S. G. (1987): “The power of nonparametric tests of preference maximiza- tion,” Econometrica: Journal of the Econometric Society, pp. 693–698. CABRALES, A., O. GOSSNER, AND R. SERRANO (2010): “Entropy and the value of infor- mation for investors,” Available at SSRN 1720963. CAPLIN, A., AND M. DEAN (2011): “Search, choice, and revealed preference,” Theoretical Economics, 6(1), 19–48. CAPLIN, A., M. DEAN, AND D. MARTIN (2011): “Search and satisficing,” The American Economic Review, pp. 2899–2922. CAPLIN, A., AND A. SCHOTTER (2010): The foundations of positive and normative economics: a handbook. Oxford University Press. 143 CHERCHYE, L., I. CRAWFORD, B. DE ROCK, AND F. VERMEULEN (2009): “The revealed preference approach to demand,” Contributions to Economic Analysis, 288, 247– 279. CHERNOZHUKOV, V., H. HONG, AND E. TAMER (2007): “Estimation and confidence regions for parameter sets in econometric models1,” Econometrica, 75(5), 1243–1284. CHOI, S., R. FISMAN, D. GALE, AND S. KARIV (2007): “Consistency and Heterogeneity of Individual Behavior under Uncertainty,” American Economic Review, 97(5), 1921– 1938. CRAWFORD, I., AND B. DE ROCK (2014): “Empirical Revealed Preference,” Annual Review of Economics, 6(1), 503–524. DE CLIPPEL, G., AND K. ROZEN (2014): “Bounded rationality and limited datasets,” . DEAN, M., AND D. MARTIN (2012): “A Comment on How Demanding Is the Revealed Preference Approach to Demand?,” Discussion paper, Working Paper. (2013): “Measuring rationality with the minimum cost of revealed preference violations,” Working paper. DEKEL, E., AND B. L. LIPMAN (2010): “How (Not) to Do Decision Theory,” Annual Review of Economics, 2(1), 257–282. DIEWERT, W. E. (2012): “Afriat’s Theorem and some Extensions to Choice under Uncer- tainty*,” The Economic Journal, 122(560), 305–331. ECHENIQUE, F., S. LEE, AND M. SHUM (2011): “The money pump as a measure of re- vealed preference violations,” Journal of Political Economy, 119(6), 1201–1223. ELIAZ, K., AND R. SPIEGLER (2011): “Consideration sets and competitive marketing,” The Review of Economic Studies, 78(1), 235–262. 144 FALMAGNE, J. C. (1978): “A representation theorem for finite random scale systems,” Journal of Mathematical Psychology, 18(1), 52–72. FAMULARI, M. (1995): “A household-based, nonparametric test of demand theory,” The Review of Economics and Statistics, pp. 372–382. FLEISSIG, A. R., AND G. A. WHITNEY (2005): “Testing for the Significance of Violations of Afriat’s Inequalities,” Journal of Business & Economic Statistics, 23(3), 355–362. GABAIX, X., AND D. LAIBSON (2008): “The seven properties of good models,” The foundations of positive and normative economics. Oxford University Press, Oxford, pp. 292–299. GEISSER, S., AND W. F. EDDY (1979): “A Predictive Approach to Model Selection,” Journal of the American Statistical Association, 74(365), 153–160. GIBBARD, A., AND H. R. VARIAN (1978): “Economic models,” The Journal of Philosophy, pp. 664–677. GUL, F., P. NATENZON, AND W. PESENDORFER (2014): “Random Choice as Behavioral Optimization,” Econometrica, 82(5), 1873–1912. GUL, F., AND W. PESENDORFER (2006): “Random Expected Utility,” Econometrica, 74(1), 121–146. GUL, F., W. PESENDORFER, ET AL . (2008): “The case for mindless economics,” The foundations of positive and normative economics, 3, 39. HALEVY, Y., D. PERSITZ, AND L. ZRILL (2014): “Parametric Recoverability of Preferences,” Discussion paper, Microeconomics. ca Website. HARBAUGH, W. T., K. KRAUSE, AND T. R. BERRY (2001): “GARP for kids: On the devel- opment of rational choice behavior,” American Economic Review, pp. 1539–1545. 145 HJERTSTRAND, P. (2013): “A Simple Method to Account for Measurement Errors in Re- vealed Preference Tests,” Working Paper Series 990, Research Institute of Industrial Economics. HJERTSTRAND, P., AND J. SWOFFORD (2009): “A stochastic test of the number of revealed preference violations,” Working paper. HJERTSTRAND, P., AND J. SWOFFORD (2014): “Are the choices of people stochastically ra- tional? A stochastic test of the number of revealed preference violations,” Empirical Economics, 46(4), 1495–1519. HODERLEIN, S., AND J. STOYE (2009): “Revealed Preferences in a Heterogeneous Popu- lation,” Review of Economics and Statistics. HOETING, J. A., D. MADIGAN, A. E. RAFTERY, AND C. T. VOLINSKY (1999): “Bayesian Model Averaging: A Tutorial,” Statistical Science, 14(4), pp. 382–401. HOUTHAKKER, H. S. (1950): “Revealed Preference and the Utility Function,” Economica, 17(66), pp. 159–174. HOUTMAN, M., AND J. MAKS (1985): “Determining all maximal data subsets consistent with revealed preference,” Kwantitatieve methoden, 19, 89–104. KASS, R. E., AND A. E. RAFTERY (1995): “Bayes factors,” Journal of the american statistical association, 90(430), 773–795. KOCOSKA, L. (2012): “A non-parametric approach to the estimations of critical inputs to economic models based on consumption data,” Ph.D. thesis. LEWBEL, A. (2001): “Demand Systems with and without Errors,” American Economic Review, pp. 611–618. LUCE, R., AND P. SUPPES (1965): Handbook of Mathematical Psychology, Vol. IIIchap. Preference, Utility, and Subjective Probability, in (R. D. Luce and Galanter 1965). 146 MÄKI, U. (2005): “Models are experiments, experiments are models,” Journal of Economic Methodology, 12(2), 303–315. MANZINI, P., AND M. MARIOTTI (2014): “Stochastic Choice and Consideration Sets,” Econometrica, 82(3), 1153–1176. MAS-COLELL, A. (1978): “On revealed preference analysis,” The Review of Economic Studies, pp. 121–131. MASATLIOGLU, Y., D. NAKAJIMA, AND E. Y. OZBAY (2012): “Revealed attention,” The American Economic Review, 102(5), 2183–2205. MCCLELLON, M. (2015): “Unique Random Utility Representations,” . MOLCHANOV, I. (2005): Theory of Random Sets. Springer, London. OLKIN, I. (ed.) (1960): Contributions to Probability and Statistics. Stanford University Press, Stanford. R. D. LUCE, R. R. B., AND E. GALANTER (eds.) (1965): Handbook of Mathematical Psychology, Vol. III. Wiley, New York. REUTSKAJA, E., R. NAGEL, C. F. CAMERER, AND A. RANGEL (2011): “Search dynamics in consumer choice under time pressure: An eye-tracking study,” The American Economic Review, pp. 900–926. RICHTER, M. K. (1996): “Revealed preference theory,” Econometrica: Journal of the Econometric Society, pp. 635–645. RUBINSTEIN, A., AND Y. SALANT (2006): “A model of choice from lists,” Theoretical Economics, 1(1), 3–17. SALANT, Y., AND A. RUBINSTEIN (2008): “(A, f): Choice with Frames,” The Review of Economic Studies, 75(4), 1287–1296. 147 SAMUELSON, P. A. (1938): “A Note on the Pure Theory of Consumer’s Behaviour,” Economica, 5(17), pp. 61–71. SAMUELSON, P. A. (1948): “Consumption Theory in Terms of Revealed Preference,” Economica, 15(60), pp. 243–253. SANTOS, B. D. L., A. HORTACSU, AND M. R. WILDENBEEST (2012): “Testing Models of Consumer Search Using Data on Web Browsing and Purchasing Behavior,” American Economic Review, 102(6), 2955–80. SCHWARTZ, B., A. WARD, J. MONTEROSSO, S. LYUBOMIRSKY, K. WHITE, AND D. R. LEHMAN (2002): “Maximizing versus satisficing: happiness is a matter of choice.,” Journal of personality and social psychology, 83(5), 1178. SELTEN, R. (1991): “Properties of a measure of predictive success,” Mathematical Social Sciences, 21(2), 153–167. SIMON, H. A. (1955): “A behavioral model of rational choice,” The quarterly journal of economics, pp. 99–118. SIPPEL, R. (1997): “An experiment on the pure theory of consumer’s behaviour*,” The Economic Journal, 107(444), 1431–1444. SPENCE, M. (1973): “Job market signaling,” The quarterly journal of Economics, pp. 355–374. SUGDEN, R. (2000): “Credible worlds: the status of theoretical models in economics,” Journal of Economic Methodology, 7(1), 1–31. (2009): “Credible worlds, capacities and mechanisms,” Erkenntnis, 70(1), 3– 27. VARIAN, H. R. (1982): “The nonparametric approach to demand analysis,” Econometrica: Journal of the Econometric Society, pp. 945–973. 148 (1985): “Non-parametric analysis of optimizing behavior with measurement error,” Journal of Econometrics, 30(1), 445–458. (1990): “Goodness-of-fit in optimizing models,” Journal of Econometrics, 46(1), 125–140. (2006): “Revealed preference,” Samuelsonian economics and the twenty-first century, pp. 99–115. WARD, D. (1992): “The role of satisficing in foraging theory,” Oikos, pp. 312–317.