Studying the Causes and Consequences of Population Mobility Using Mobile Phone Data Svetoslava Milusheva B.A., Emory University, 2010 M.A., Brown University, 2013 A dissertation submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy in the Department of Economics at Brown University Providence, Rhode Island May 2017 c Copyright 2017 by Svetoslava Milusheva This dissertation by Svetoslava Milusheva is accepted in its present form by the Department of Economics as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date Andrew Foster, Advisor Recommended to the Graduate Council Date Jesse Shapiro, Reader Date Daniel Björkegren, Reader Approved by the Graduate Council Date Andrew G. Campbell, Dean of the Graduate School iii Curriculum Vitae The author is originally from Plovdiv, Bulgaria. She began her studies at Emory University in Atlanta, Georgia in 2006, where she majored in Economics and International Studies and minored in Spanish and Russian. She was an Emory Scholar, receiving a full fellowship for her studies, and entered Phi Beta Kappa in 2009. She graduated summa cum laude from Emory with a B.A. in International Studies and Economics in 2010. After graduating, she worked as a research assistant at the Brookings Institution for two years before matriculating in Brown University's doctoral program in economics. At Brown, she received a National Science Foundation IGERT Traineeship, a National Institute of Child Health and Human Development T32 Fellowship, a Social Science Research Council Dissertation Proposal Development Fellowship and an Interdisciplinary Opportunity Award. She received her Master's degree in Economics from Brown University in 2013 and her Ph.D. in Economics from Brown in 2017. iv Acknowledgements First and foremost, I would like to thank my dissertation committee, Andrew Foster, Jesse Shapiro and Daniel Björkegren. I am especially grateful to Andrew, who has been one of my strongest supporters and advocates through the Ph.D. program. His commitment as an advisor and the countless hours he spent helping me work through tough problems made the biggest dierence in my path at Brown. I am very grateful to Jesse, whose encouragement at the beginning stages of my cell phone data work pushed me to continue despite its non-standard nature, and whose guidance helped me to strengthen and expand my research. I am indebted to Dan, who is the reason that I followed this direction in my research, which helped me discover my passion for working with large, complex datasets. He has also supported me through the entire process, providing me with valuable comments and feedback every step of the way. I would like to thank Emily Oster for her incredible work as the Job Placement Coordinator and Angelica Vargas for her help and support with everything that we needed from before we even entered Brown and through the entire journey. I would also like to thank Bryce Steinberg, Anja Sautmann, and participants at the World Bank ABCDE Conference, the NEUDC Conference, the PopPov Conference, the Population Health Sciences Workshop, the PAA Conference, the Brown microeconomics seminar and the Brown development economics working group for very helpful feedback and comments. I am also extremely appreciative of Andy Tatem, Victor Alegana, Nick Ruktanonchai, Elisabeth zu Erbach, Jessica Steele, and Cori Ruktanonchai for helping me understand this research area so much better and taking the time to meet with me, discuss and provide invaluable feedback. I am also grateful to the advisors and mentors prior to Brown that are the reason I am here today. Andrew Francis was the rst professor to show me that Economics meant more than supply and demand and made me passionate about using the tools from economics to answer the many interesting questions that come up all around v us. Mary Odem introduced me to research, and acting as her research assistant sophomore year of college made me realize my interest in studying migration and population mobility. Maura Belliveau was integral in pushing me as a researcher and encouraging me to pursue a Ph.D. I learned more working as her research assistant than most of my undergraduate classes combined and her creativity and innovation are something I aspire to. Thank you to the Bill & Melinda Gates Foundation (grant OPP1114791) and the NICHHD (grant T32 HD 7338-29) for nancial support. Anonymous mobile phone data have been made available by Orange and Sonatel within the framework of the D4D-Senegal challenge. I am especially grateful to Nicolas de Cordes, Stephanie de Prèvoisin and Coumba Sangare for their tireless work in organizing the Challenge and helping the researchers succeed. I also want to thank Cezary Ziemlicki at Orange for responding to my numerous emails and working so hard to make it possible for me to access the data, and the other ve Data for Development teams that I had the opportunity to work with in Senegal and from whom I've learned so much. I also thank PATH and the National Malaria Control Program for providing the data on malaria incidence, and especially Philippe Guinot, Yakou Dieye, and Mady Ba for taking the time to work with me closely, and teaching me about malaria in Senegal and the amazing work they have been doing to help decrease incidence and improve the lives of so many people. I would not have had the opportunity to write this dissertation without my rst year study groupAdrian Rubli, Mike Bedard, Jamie Hansen-Lewis, Will Violette, Jean-Yves Gerardy, and Simon Freyaldenhoven. The generosity they showed with their time in their willingness to spend hours teaching me and helping me understand all of the rst year material is something that I will always be grateful for, and I could not have made it through that rst year without them. I am also thankful for their friendship and help throughout the program. I appreciate the constant vi support from my oce mates David Glick and Girija Borker, and feel so lucky that I was able to share the majority of my experience here with my wonderful roommate Pavitra Govindan. I am also thankful for my many friends here in Providence that have shaped my experience here. I am especially grateful to Morgan Hardy, Angelica Meinhofer, Daniela Scida, Cait Ramsay, Molly McCloskey, and Hilary Nicholson, for always being there for me and providing me with support, laughs, baked goods, and anything I might need to overcome challenges and be successful. I feel so lucky to have this amazing support network here in Providence, as well as my larger support network in DC, Boston, New York, Madison, Baltimore, Pittsburgh and around the world that has always only been a phone call away and there for me no matter what. I am very grateful to Denis Murray, who has been through my ups and downs in the program and has stood by me and supported me and my goals in every way possible. I also have to acknowledge Dizzy, whose sat by my side through the writing of most of this dissertation, providing me with comfort and smiles. Finally, the people that have shaped me the most and have been the biggest cheerleaders are my family. I am so lucky to have my brilliant sister Plamena, that has been a role model throughout my life. My parents' love and encouragement is something that I am always thankful for, and I would not be here without the ambition, sense of humor, intellectual curiosity, and hard work ethic that they instilled in me. Thank you all for the unconditional support. vii Contents Introduction 1 Less Bite for Your Buck: 1 Using Mobile Phone Data to Target Disease Prevention 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.A 1.B 1.C 1.D 2 5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . Background and Data . . . . . . . . . . . . . . . . . . 1.2.1 Malaria Characteristics . . . . . . . . . . . . . . 1.2.2 Health System and Malaria in Senegal . . . . . 1.2.3 Population Movement in Senegal . . . . . . . . 1.2.4 Malaria Data . . . . . . . . . . . . . . . . . . . 1.2.5 Population Movement Data . . . . . . . . . . . Model of Malaria . . . . . . . . . . . . . . . . . . . . . 1.3.1 Theoretical Model . . . . . . . . . . . . . . . . 1.3.2 Empirical Model . . . . . . . . . . . . . . . . . 1.3.3 Identication . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Quantifying the Eect of Imported Cases . . . . 1.4.2 Mechanism Evidence . . . . . . . . . . . . . . . Policy Targeting Strategies . . . . . . . . . . . . . . . . 1.5.1 Malaria Policies Toward Travelers . . . . . . . . 1.5.2 Targeting Simulations . . . . . . . . . . . . . . Robustness Checks . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . Additional Tables and Figures . . . . . . . . . . . . . . Primary and Secondary Case Detection in Richard Toll Validation Exercise . . . . . . . . . . . . . . . . . . . . Additional Robustness Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicting Dynamic Patterns of Short Term Movement and Implications for Studying Malaria Transmission 2.1 2.2 2.3 2.4 3 5 10 10 11 12 13 14 18 18 20 23 24 24 26 28 28 30 33 34 45 49 51 53 Introduction . . . . . . . . . . . Data . . . . . . . . . . . . . . . Empirical Analysis and Results Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . . . . . Bias in Tracking Movement Using Mobile Phone Data viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 58 60 62 69 CONTENTS 3.1 3.2 3.3 3.4 Introduction . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . Model . . . . . . . . . . . . . . . . . . . . . . . . . Analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Testing the Independence of the Pings . . . 3.4.2 Results for Days with Calls . . . . . . . . . 3.4.3 Results After Interpolation of Missing Days 3.5 Policy Applications . . . . . . . . . . . . . . . . . . 3.6 Robustness Checks . . . . . . . . . . . . . . . . . . 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 3.A Additional Tables and Figures . . . . . . . . . . . . Bibliography ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 . 72 . 75 . 77 . 77 . 79 . 79 . 83 . 85 . 87 . 101 103 List of Tables 1.1 1.2 1.3 1.B.1 1.D.1 Eect of Imported Malaria Incidence . . . . . . . . . . Placebo Tests . . . . . . . . . . . . . . . . . . . . . . . Robustness Checks . . . . . . . . . . . . . . . . . . . . Primary and Secondary Cases in Richard Toll District Robustness Checks . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 44 50 55 2.1 Standard Deviation Change in People Entering for a Standard Deviation Change in Each Variable . . . . . . . . . . . . . . . . . . . 68 3.1 3.2 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary Statistics for Individuals Receiving and Not Receiving Pings 93 Checking the Random Pings for Independence . . . . . . . . . . . . 94 Results of the Analysis Broken Down by Percent of Time Present in the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.4 Theoretically Fixed Results . . . . . . . . . . . . . . . . . . . . . . 96 3.5 Correcting how Missing Days Are Assigned . . . . . . . . . . . . . 97 3.6 Removing Individuals in the Data with Over 15,000 Calls . . . . . 98 3.7 Checking the Random Pings for Independence Based on New Denition 99 3.8 Results of the Analysis for Dierent Denition of Pings . . . . . . . 100 3.A.1 Results of the Analysis Broken Down by Number of Days Present in the Data, Call and Ping Days Looked at Separately . . . . . . . . . 102 x List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.A.1 1.A.2 1.A.3 1.A.4 2.1 2.2 2.3 2.4 Average Monthly Healthpost Catchment Area Malaria Incidence per 1000 and Average Monthly Rainfall, Jan 2013-Dec 2015 by Health District . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . People Entering a Health Post Catchment Area as a Percent of the Population in the Area . . . . . . . . . . . . . . . . . . . . . . . . . Predicted with and without Imported Cases and Actual Incidence Averaged Across Health post Areas by District . . . . . . . . . . . Estimated Impact of Future, Current and Past Expected Imported Malaria Incidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eect on Malaria Cases of a Policy Targeting Migrant Workers at the Senegalese Sugar Company . . . . . . . . . . . . . . . . . . . . Cost and Benet Under Dierent Strategies . . . . . . . . . . . . . Annual Malaria Incidence in Senegal in 2013 . . . . . . . . . . . . . Senegal Health Districts and Location of Health Posts in the North Used in the Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated Impact of Future Expected Imported Malaria Incidence and Past Total Incidence . . . . . . . . . . . . . . . . . . . . . . . . Testing the Approximation of the Entomological Innoculation Rate in Assumption 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Actual versus Predicted Short Term Movement . . . . . . . . . . . Actual and Predicted Average Number of People Entering a District per Month in 2013, by Region in '000s . . . . . . . . . . . . . . . . Actual versus Predicted Expected Imported Incidence of Malaria . Coecient on Expected Imported Malaria Cases . . . . . . . . . . 3.1 36 37 38 39 40 41 45 46 47 48 64 65 66 67 Comparison of Distribution of All Day Locations Based on Calls and Pings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.2 Annual Comparison of Population Numbers . . . . . . . . . . . . . 90 3.3 Daily Comparison of Population Numbers . . . . . . . . . . . . . . 91 3.4 Monthly Comparison of Malaria Incidence . . . . . . . . . . . . . . 92 3.A.1 Comparison of Distribution of All Day Locations Based on Calls and Pings Excluding Weekends . . . . . . . . . . . . . . . . . . . . . . . 101 xi Introduction There can be high mobility among populations, with many people choosing to migrate long term or making visits to spend time with family and friends or for economic reasons. There are a number of dierent factors that can potentially drive this population movement, and there are also many consequences of this movement. Population movement can lead to the spread of infectious disease, spread of information, or spread of resources. It can have repercussions for economic and labor markets. It can also have important social consequences as families might live apart. Yet understanding the eects of population movement is extremely dicult because in the majority of cases only survey data is available that might provide data on long term migration patterns, or potentially more detailed data on movement but for a small group of people since collecting this type of data is extremely time and cost intensive. When wanting to understand the impact of movement on something like infectious disease though, it is necessary to look at all movement, not just long term migration patterns, and it is important to study this at a larger scale. In my dissertation I show how mobile phone data can be used in order to study short term population movement and its consequences. In all three chapters of my dissertation I use mobile phone data provided by Sonatel, the largest mobile phone provider in Senegal, for around 9.5 million unique SIM cards in 2013. I use anonymous IDs to track the location of calls and texts made or received by SIMs over time in order to infer population movement. This type of new data allows us to track population movement at a very detailed spatial and temporal scale for an entire country. I 1 show how the data can be applied to study one particular important consequence of population movement. I also show how even in the absence of this type of data, it might be possible to use other data sources to make predictions of what short term movement might look like. And nally, I explore potential bias in the use of this data arising from non-random error in the measurement of movement. In Chapter 1, I study the negative externality of human movement through its contribution to the spread of diseases in areas of low incidence. Using the 15 billion mobile phone records for 2013, I estimate daily interregional movement and demonstrate substantial intertemporal and geographic variation in movement. I link this variation to clinic data on the incidence of malaria in order to calculate the probability a traveler is infected and to determine the impact in the area a traveler enters. Estimates indicate that an expected infected traveler entering a health facility's area causes reporting of 1.6 additional cases. Applying the results to policy simulations, I nd that at the same cost, strategic targeting of travelers would result in over twice as many cases averted as compared to the next best strategy. These ndings indicate how novel applications of big data, combined with traditional health measures, can enable improvements in policy to address negative spillovers from travel and lower the burden from communicable disease. Yet this type of mobile phone data is not yet easily available in all country contexts, and even in the countries where it is made available to researchers or policy makers, so far only a piece of the data (a year or several months) is made available. So as policy makers try to craft policies incorporating information on population movement, it becomes necessary to see whether it is possible to use the data that is available in order to make predictions of the short term movement that should be expected. In Chapter 2, I put together data that is available for the majority of low income countries, and use Senegal as a case study to predict short term movement within the country on a monthly basis. I focus on two main drivers of movementeconomic and 2 socialwith which I explain almost 70% of the variation in short term movement. I then study the impact of population movement on spread of malaria in Senegal using the actual movement calculated from the mobile phone data and using the predicted short term movement. The predictions generated by my model can provide good estimates for the eect of short term movement on malaria. In the rst two chapters of this dissertation, I assume that any error in the measurement of population movement is random and should therefore not impact the average results that I obtain. Yet, people's propensity to make calls or send text messages is a function of their environment and circumstances, and I can only see a SIM in a location if the person has made or received a call or text message. As new research using mobile phone data grows and begins to inuence policy it is critical to understand potential non-random error in the measurement of movement that could bias movement estimates and ultimately the nal results of an analysis. In 3, Chapter I use the calls and text messages received from accounts that make calls en masse, which are exogenous in timing to population movement, in order to study whether non-random measurement error exists in the likelihood of detecting an individual in a particular location. I also present dierent ways of addressing the bias that I nd, which could be employed by researchers and policy makers. Finally, I study the specic example of measuring population size and the impact of population movement on malaria transmission in order to see whether the non-random measurement error might lead to a bias in analyses being conducted using this data. This dissertation aims to provide a holistic view of the use of mobile phone data in economics research focused on population mobility. While mobile phone data has been used primarily in other disciplines such as computer science, I want to demonstrate its potential for use in economics and how it can help with answering important policy-relevant questions that we have not been able to study before due to a dearth of data concerning the consequences of population mobility. I also show how 3 even in the absence of this data, we can learn lessons from the data that is available that can allow us to make better predictions about short term population mobility. And nally, while this new source of data is exciting and has incredible potential for allowing us to look at new questions, it is also important to remember that the data was not collected for this purpose, and it is important to be cautious and understand potential bias that might arise from using the data to calculate things like population mobility. This dissertation is a preliminary look at this topic, and an area that I plan to pursue and expand further by employing mobile phone data from additional settings as well as testing the policy applicability of this data. 4 Chapter 1 Less Bite for Your Buck: Using Mobile Phone Data to Target Disease Prevention 1.1 Introduction The Global Fund has disbursed nearly $28.4 billion in the last decade to reduce the disease burden from malaria, TB and HIV (The Global Fund 2016).1 However, travelers can reverse the progress from campaigns that have decreased infectious disease prevalence (Justin Cohen et al. 2012, Lu et al. 2014), or can rapidly spread emerging diseases such as Ebola and Zika (Tam, Khan, and Legido-Quigley 2016, Bogoch et al. 2016). Spillovers can arise when infected individuals from regions with a higher disease burden enter areas where a disease has been eliminated or reduced. For example, Venezuela, the rst country certied by the World Health Organization (WHO) for eliminating malaria, has seen a large resurgence of the disease in the last year due to migrant workers in the mining region who have become sick, traveled home, 1 In addition to reducing mortality and morbidity, elimination of infectious diseases has been shown to have important economic consequences (Bleakley 2007, Bleakley 2010). 5 and spread the disease to their home villages and cities (Casey 2016). Given that migration has numerous economic and social benets, policymakers face important trade-os in designing policies to reduce travel-linked malaria cases. This chapter provides a useful framework for identifying high-risk populations in order to reduce malaria incidence with minimal interference to movement patterns. The main challenge in studying the impact of travel on disease and how to mitigate it is the diculty of tracking population movement and infections. Both disease transmission and movement are highly variable over space and time; therefore, detailed data are necessary to account for their seasonality and spatial distribution. For example, if a disease is only present during the rainy season, annual data on population movement and disease could provide a distorted estimate of the relationship between the two. Additionally, even if the measurement challenge is overcome and the relationship between travel and disease is quantied, it is unclear how a government should determine who to target to prevent the spread of diseases. This chapter overcomes this challenge by utilizing a new source of datamobile phone recordsto track population movement for a large number of people at a very high spatial and temporal resolution. I combine the mobility data with comprehensive disease incidence data. I employ an empirical model of the relationship between population movement and disease to measure the eect of a single expected infected traveler to an area on the area's disease burden. The empirical model suggests a cost-eective strategy to target travelers with preventative interventions. I nd that for a government budget under $100,000 dedicated to this program, switching to the cost-eective strategy can avert on average 2.2 times as many cases as compared to the next best strategy. To quantify the impact of travel on malaria incidence, the chapter builds on an existing epidemiological model of malaria propagation to incorporate population movement. I use mobile phone data for 9.5 million SIM cards in Senegal in 2013 to extract 6 patterns of movement between dierent areas from the approximate location of 15 billion calls and texts. From the origins, time and length of travel of incoming travelers to an area, I compute the expected imported incidence. I use a panel data strategy, estimating the impact of imported incidence on total malaria incidence controlling for location and time xed eects. I examine the impact of expected imported incidence in areas that are close to eliminating malaria. If infected travelers only represented a change in location of a case, but did not generate any externality in the form of additional infections, then for each expected imported case there should be just one more additional case reported in the area. Instead, I nd that one additional expected imported case of malaria per thousand people living in a low malaria area leads to 1.6 cases of malaria reported per thousand people. There are several potential threats to identication that I address in the chapter. I nd that population movement over time within an area is largely driven by agricultural seasons and holidays, so I consider confounders that might be correlated with these factors and with malaria. I control for the agricultural seasons in my analysis using rainfall. I use a placebo test to ensure the results are not being driven by holidays increasing movement and also causing malaria to increase through another channel. Additional placebo tests are used to check that the relationship between imported incidence and malaria incidence is not being driven by some other relationship between origins and destinations as well as to ensure that the relationship only holds for malaria and not other health conditions. I also test whether future imported incidence impacts malaria incidence in the current month and nd that it does not. I argue that this evidence is suggestive of a causal relationship. This chapter demonstrates how detailed movement data can be used to craft a targeted intervention that addresses negative externalities associated with travel. I rst examine four strategies for addressing the impact of travelers. The simplest is 7 a random untargeted strategy throughout the year, the second is a random strategy during the malaria season, the third targets districts based on malaria incidence in the prior year, and the nal targets travelers in certain months using my estimated mobility and incidence model. I run simulations based on the model estimated and construct four cost-curves for the potential targeting strategies. The cost-eective strategy based on my model leads to a decrease in cases deriving from travelers that ranges from being 25 percent to over nineteen times larger depending on the alternative strategy used as a comparison. I also examine not only targeting particular travelers, but also implementing in particular areas, which on average is 2.2 times as eective as the next best policy. This chapter relates closely to work by Tatem, Qiu, et al. (2009) and Le Menach et al. (2011) that uses movement patterns from three months of cell phone data and endemicity data to estimate the malaria importation rate to Zanzibar as 1.6 cases per 1000 residents per year. I build on this work by developing an epidemiological model that I empirically estimate using incidence data2 and a full year of mobility data, which allows me to capture the seasonality and spatial variation in movement and incidence. Annually, I nd an importation rate of only 0.4 cases per 1000 residents. When this is disaggregated monthly, the rate during the highest malaria month is 44 times the rate in the lowest month. This chapter also contributes to the literature that establishes travel as an important risk factor for contracting malaria (Montalvo and Reynal-Querol 2007,Lynch et al. 2015, Osorio, Todd, and Bradley 2004, Siri et al. 2010, Littrell et al. 2013) by demonstrating the negative externality of travel in the form of secondary cases. Furthermore, prior work using mobile phone data to study population movement and malaria has categorized the transmission risk for a set of origins and destinations, 2 The endemicity data are gathered from parasite rate surveys in which a random subsample of the population is tested for malaria parasites. When the malaria prevalence is very low, the likelihood of having a positive case becomes very small. In this setting, incidence can be a more reliable measure (Alegana et al. 2013, J. M. Cohen et al. 2013) 8 but has stopped short of quantifying the number of new cases due to travel patterns (Wesolowski, Eagle, et al. (2012) and Enns and Amuasi (2013)). A plethora of research has shown that as infected individuals travel, they can carry a disease with them and transmit it in new areas, leading to negative spillovers from areas where the disease is prevalent (Oster 2012,Prothero 1977, Balcan et al. 2009, Huang, Tatem, et al. 2013, Tatem and D. L. Smith 2010, Wesolowski, Metcalf, et al. 2015, Wesolowski, Qureshi, et al. 2015, Stuckler et al. 2011, Pison et al. 1993). Yet, more work is necessary to understand how to manage these spillovers. Given the billions of dollars spent on reducing infectious diseases, it is critical that the money is spent eectively in order to prevent reversing the progress made. While much of the previous work has focused on what type of policy should be implemented (Adda 2016, Lima et al. 2015, Epstein et al. 2007) or the price of particular malaria interventions (Jessica Cohen and Dupas 2010, Jessica Cohen, Dupas, and Schaner 2015), here I focus on who should be targeted, irrespective of the policy utilized. This chapter uses the estimated model for a quantitative policy design exercise that evaluates the cost-eectiveness of targeting specic groups of travelers with policies. While this chapter focuses on malaria, it also has implications for other diseases, since the travel patterns studied using the cell phone data could lead to the transmission of any communicable disease. If these data are obtained for other countries or for dierent diseases, it is possible to replicate the analysis using the methodology developed in this chapter. I demonstrate how new sources of big data can be used to study externalities associated with population movement and can help to inform policies to ght infectious disease. The chapter begins by providing some background and describing the data. It then goes on to model the link between malaria and population movement in section 3. Section 4 outlines the empirical results linking travel to malaria and section 5 examines the cost eectiveness of four dierent policies. Some robustness checks are 9 provided in section 6, and the chapter concludes with section 7. 1.2 Background and Data 1.2.1 Malaria Characteristics Malaria is an infectious disease that requires two hostshumans and mosquitoesin order to spread. The malarial cycle for P. falciparum, the parasite causing 100 percent of cases in Senegal, can take several weeks (World Malaria Report 2014). After an infected individual is bit by a mosquito, there is an incubation period lasting around 9 days within the mosquito (Killeen, A. Ross, and T. Smith 2006).3 If the mosquito survives the incubation period, it can bite and infect a healthy individual, after which there is a second incubation period within the human of around 15 days (D. L. Smith and McKenzie 2004, Hoshen and Morse 2004). Symptoms will appear at the end of this period and the individual will become infectious.4 Combining the two incubation periods, a secondary case will take around one month to appear after a primary case.5 There are two channels through which population movement can spread malaria from high malaria to low malaria areas. The rst is residents of low malaria areas who travel to high malaria areas and become infected when bit by infected mosquitoes. Since malaria symptoms do not appear for around two weeks, the resident can travel home feeling healthy. Once at the home location, the person can become symptomatic, as well as infect mosquitoes. These infected mosquitoes can infect other individuals and pass on the disease. The second channel is visitors or migrants that live in a high malaria area and travel to a low malaria area. Again, at the beginning 3 The incubation period can vary, but the value of 9 days was found to be the average for two dierent sites in Senegal. 4 Unlike some of the other malarial parasites, P. falciparum does not have the potential to lie dormant for months or years. 5 Details on malaria transmission can be found in Doolan, Dobaño, and Baird 2009, D. L. Smith and McKenzie 2004, Killeen, A. Ross, and T. Smith 2006, Wiser 2010). 10 of their travel, these individuals might not exhibit symptoms, but can still be carriers of the disease.6 Therefore if they are bit by a mosquito in the low malaria area, they could infect that mosquito and it could in turn infect other individuals.7 1.2.2 Health System and Malaria in Senegal Senegal is geographically divided into 14 health regions, under which there are 76 health districts. The main point of service for malaria cases is the health post. There are a total of 1,247 health posts in the country (PNLP, INFORM, LSHTM 2015). In addition, there are rural health huts and community health workers that provide care for those living far from a health post, and report the cases to the closest health post. Since the establishment of the National Malaria Control Program (PNLP) in 1995, the program has coordinated a variety of measures and policies that have led to a reduction in deaths attributed to malaria from 12.93 per 100,000 people in 2000 to 8.26 in 2013 (PNLP, INFORM, LSHTM 2015). Currently, the north of the country has very low incidence and is at the level considered ready for elimination by the World Health Organization (1 case per 1000, also known as the pre-elimination phase). In contrast, the south still has a high case load, with some districts as high as 270 6 There are three possibilities in the case of visitors. 1. The person was recently infected and will become symptomatic at the end of the incubation period. 2. In areas where the disease is highly prevalent, acquired immunity can develop. The person with such immunity still has a small amount of the parasite, but does not exhibit symptoms. In some cases though, they can develop symptoms after arriving to the low malaria area and can then infect mosquitoes. Anecdotally, people often travel for seasonal agricultural work, and the physical stress of the work lowers the immune system causing the parasite that was within the immune person to develop faster and leading to symptoms and a higher likelihood of infecting mosquitoes. 3. The person has acquired immunity and never becomes symptomatic. These people can still infect mosquitoes and lead to new cases. Since the data used in the chapter measures only symptomatic individuals that are tested by a health worker, I do not capture these asymptomatic individuals. Nevertheless, research nds that while these individuals can spread malaria, the rate is two to sixteen times lower compared to symptomatic individuals (Okell et al. 2012). 7 Since the average radius of travel for the mosquitoes that carry the malaria parasite in Senegal is around 1-2 km, mosquito movement is not considered because the analysis is carried out on a larger geographic scale(Russell and Santiago 1934, Thomas, Cross, and Bøgh 2013). 11 cases per 1000.8 The heterogeneity can be partly attributed to environmental factors because the rainy season is twice as long in the south as in the north, which allows for mosquitoes to breed and spread the disease for longer, but the mosquitoes required to spread the disease are present in the low malaria areas (Ndiath et al. 2012). Given the two distinct zones in the country, the government of Senegal strives to continue reducing the case load in the South, while aiming to eliminate completely from the North. As potentially infected individuals travel from the South to the North, though, they can hinder elimination eorts in the North. 1.2.3 Population Movement in Senegal Senegal has large ows of long term and permanent migration, with 27% of the population recorded as an internal migrant in 2004 (P. D. Fall, Carretero, and M. Y. Sarr 2010).9 A large part of this migration is rural to urban due to irregularity of rainfall and degradation of the ecosystems that have impacted agricultural activity (P. D. Fall, Carretero, and M. Y. Sarr 2010, Goldsmith, Gunjal, and Ndarishikanye 2004). In turn, this longer term migration can lead to commuting patterns as people return home to visit family and friends or receive visitors from home (Cho, Myers, and Leskovec 2011). Focusing on migrants in Dakar, A. S. Fall (1998) nds that 87% of male and 81% of female migrants visited their home areas, with the majority of visits occurring for holidays, family ceremonies and religious festivals. Detailed studies of the Jola ethnic group in several villages nds that circular migration plays an important role, with over 80% of unmarried Jola youth traveling to the cities in October and then coming back before the rice harvest in June-July (Linares 2003). Broader research on youth in Senegal has shown that more than half of the internal migration they engage in is temporary and rural to rural or urban 8 Appendix 1.A contains a map of annual malaria incidence by district. 9 This is comparable to the rest of sub-Saharan Africa, where 50-80 percent of rural households were estimated to have at least one migrant (Deshingkar and Grimm 2005). 12 to urban (Herrera and Sahn 2013). Additionally there are still pastoral groups that travel within a set territory, but may also send paid herders with their livestock over longer distances in order to maintain their herds (Adriansen 2008). Understanding the movement patterns within Senegal is important for thinking through potential confounding factors between movement and malaria. The majority of the literature points to movement triggered by agricultural seasons as well as holidays. These factors and their relationship to malaria incidence will be discussed in the model section. 1.2.4 Malaria Data The data used to measure malaria incidence comes from the PNLP and MACEPA, a non-prot organization working with the PNLP to ght malaria in Senegal. They collect data on all new cases in the reporting month. If an individual feels sick, usually experiencing a fever, chills and fatigue, she will go to the closest health post where she will be tested using a rapid diagnostic test (RDT) due to her symptoms. If she tests positive, she will be provided with medication for free to treat the disease. Case data are available monthly by health district for the whole country. For a few Northern districts, data is available monthly at the health post level. I use health post data for ve very low malaria districts, covering 117 health posts, where the PNLP wants to focus on targeting travelers. Appendix 1.A provides a map of the ve health districts, which I subdivide into areas based on the location of the health posts and cell phone towers. Health posts in close proximity were grouped together forming 36 health post catchment areas. Monthly malaria incidence per 1000 people is averaged across health post catchment areas within districts for three years in Figure 1.1. Districts on average have around 0.1 cases per 1000 people per month. The gure also overlays the monthly cumulative rainfall in centimeters averaged across health post catchment areas.10 The 10 Rainfall data is from the Climate Prediction Center (2016) Rainfall Estimator for Africa. 13 comparison of cases and rainfall demonstrates strong seasonality of malaria in Senegal and the close relationship between rainfall and malaria, with the peak of cases annually occurring one to two months after the peak in rainfall. The relationship between malaria and rainfall is explicitly modeled in the analysis since rainfall is an important explanatory variable of malaria that could also aect population movement. There are three main challenges that arise with using clinical data: incomplete data reporting, presumptive diagnosis based on symptoms rather than testing and non-utilization of the public health system (Alegana et al. 2013). In the data only 12 out of 1416 health post-month observations are missing in 2013. In addition, 99% of suspected cases were tested parasitologically in the ve districts analyzed. Since both malaria cases and imported cases are calculated based on case data, as long as utilization is relatively uniform across the country, the eect measured should be accurate. Based on the DHS data for all the regions, a health facility was visited for fever in children under age 5 in 46% of cases (ANSD and ICF International 2015). The standard deviation of this utilization across regions is 6.5 percentage points. While in the main analysis, I assume uniform utilization, I include a robustness check where cases are scaled by regional utilization in the DHS. 1.2.5 Population Movement Data The data used to measure short term movement come from phone records made available by Sonatel and Orange in the context of the Data for Development Challenge (Montjoye et al. 2014). The data come from the second phase of the project and consist of 15 billion call and text records for Senegal between January 1, 2013 and December 31, 2013 for all of Sonatel's user base. The data contains information on all calls and texts made or received by a SIM card, their time, date and location of the closest cell phone tower, which enables tracking of SIMs in space as they make calls from dierent tower locations. The data are anonymized, so that while random 14 IDs are provided that make it possible to track the same SIM over time, there is no identifying information on the individuals. Due to the availability of only one year of cell phone data, the analysis is limited to 2013. On average there are 1,657 calls or texts per ID during the year, and on average an ID has a call or text observation on 155 days. Each tower is assigned a health district based on its GPS coordinates. An individual is assigned a health district location on a given day based on the cell tower of the last call or text of the day, as has been done in other literature (Ruktanonchai et al. 2016). In instances where there are days with no calls, the health district location of the day closest to the one missing where a location is known based on a call/text is assigned to the missing days, as has been done in previous work (Wesolowski, Metcalf, et al. 2015, Wesolowski, Eagle, et al. 2012).11 In this way a health district location is assigned to each SIM for every day of the year. Movement is dened as a change in location from one health district to another between two consecutive days. The population is highly mobile, with over 80% of all Sonatel SIM cards taking at least one trip and over a quarter million traveling on average on any given day. On average annually per SIM there are 10 dierent trips to almost ve dierent health districts. Focusing on the area in the analysis, a move is still dened as a change in health district. In addition to assigning the towers within my study area to a health district, I also assign them to a healthpost catchment area based on their GPS location. Therefore, each traveler entering one of the ve health districts is assigned a specic health post catchment area that they enter based on their last call or text of the day.12 Panel a of Figure 1.2 shows the average number of people entering one of the health post catchment areas each day as a percent of the population in that catch11 Appendix 1.D includes a robustness check where observations with more than 14 days in a row missing are removed. 12 Movements within a district between health post catchment areas are not counted. 15 ment area, along with vertical lines marking several religious holidays and important pilgrimages. The movement patterns largely align with the holidays and pilgrimages, which supports the ndings in A. S. Fall (1998) that the majority of migrants to Dakar visit their home area primarily for holidays, religious festivals and family ceremonies. On average for all the health post catchment areas, around 3 percent of the population of that area enters on any given day. Turning to Panel b of the graph, the data are broken out into individual health post catchment areas for one district. This shows that the variation in percent of people entering can vary widely by health post catchment area and date. For health post catchment areas where an important religious leader resides, on certain religious holidays the number of people entering is close to or over 50% of the population of the area. For other health posts, the beginning of certain agricultural seasons or other holidays lead to large jumps in people entering. This variation makes it possible to study the impact of people entering on malaria cases in these areas that are otherwise geographically close together and very similar. In 2013, Sonatel had slightly over 9.5 million unique phone numbers on its network. While coverage is very high, given the population of Senegal in 2013 was 14.2 million, not every individual is included directly with a SIM card.13 There are two sets of people that are potentially excluded from the data and need to be accounted for those without a phone and those with a phone but using a dierent mobile provider. Individuals without a phone can be split into two groupsadults and children. In 2013 there were 92.93 mobile phone subscriptions per 100 adults in Senegal (International Telecommunication Union 2013), and based on the 2013 Demographic Health Survey (DHS), 93% of households owned at least one cell phone (ANSD and ICF International 2015). In addition, based on the Listening to Senegal survey, of those people that do 13 While I do not have data on ownership of multiple SIM cards from the same provider, based on the Listening to Senegal survey only 9.6% of those with a phone cite using more than one mobile phone, and only 0.6% cite using more than two phones. 16 not own a phone, 79% state that they own a SIM card which they use in someone else's phone. Therefore, of the small percent of adults without a phone, the majority are already included in the data because they have a SIM card. This implies that most of those without a phone are children under the age of 15. Based on the Listening to Senegal Survey done in 2014, Sonatel is the main provider for 80% of those surveyed with a cell phone, and 88% of those with a cell phone have a Sonatel SIM card (Agence Nationale de la Statistique et de la Démographie - Ministére de l'Economie 2014). Therefore, only around 12% of adults with a phone are excluded. I weight the mobile phone data in order to account for the individuals not included. I use the DHS survey to compare mobility patterns between women with and without a cell phone in the household.14 There is no statistical dierence in whether a trip longer than a month was taken between those with and without a cell phone. The women with a cell phone in the household have only a slightly higher average number of trips taken in the last year. A big part of the missing group is children, though, who are likely traveling with adults that are represented in the data, but might be traveling less on average. Nevertheless, since there is no data comparing the short term movement of adults and children in Senegal, I use an across the board weight of 1.4 to represent the full population and to get an upper bound on movement. The weighted data overestimates total movement and underestimates the impact of each trip. While the main specication uses the weighted data, unweighted results are provided in the robustness section as an upper bound on the eect size. 14 The survey does not contain data on men. 17 1.3 Model of Malaria 1.3.1 Theoretical Model To understand the impact of travel on malaria incidence, it is necessary to break up cases into those generated locally and those imported from outside of the area. In order to separate out these dierent factors, I build on an existing epidemiological model, which allows me to compare my coecients to estimated parameters in the epidemiology literature. Due to the two host system, malaria is modeled using two dierential equations to describe the dynamics of infected humans and infected mosquitoes. These were rst modeled by R. Ross (1910) and then expanded by Macdonald et al. (1957). The model used in this chapter is a Ross-Macdonald type model based on models used in D. L. Smith and McKenzie (2004), Cosner et al. (2009) and Torres-Sorando and Rodriguez (1997). There are two state variables and the model is usually expressed in continuous time, though here I will present it in discrete time. The state variables are yit , the fraction of mosquitoes infected in location i at time t and xit , the fraction of humans infected in location i at time t. In the model extension here that includes the impact of population movement, cases can either be generated locally in location i or imported from any other location j into i, Ii . Local infections in location i are generated based on the susceptible population and several biological parameters. The susceptible population, Sit−w , is the fraction of the population that is susceptible to malaria at time t − w, where w is the total parasite incubation period. The other parameters are the transmission eciencies from infected mosquitoes to humans and humans to mosquitoes, bit−w and cit , the number of bites on humans per mosquito, ait−w and the ratio of mosquitoes to humans, mit−w . The change in the number of infected humans and mosquitoes is 18 described by: yit − yit−w = ait−w cit−w xit−w (e−µit−w τit−w − yit−w ) − µit−w yit−w (1.1) xit − xit−w = mit−w ait−w bit−w yit−w Sit−w + Iit − ri xit−w (1.2) where µit is the mortality rate of mosquitoes and τit is the incubation period from the time a mosquito becomes infected until it is infectious. The parameter ri is the recovery rate of humans. I make four assumptions based on modeling the system in a low malaria setting: 1. The total parasite incubation time is one month based on the malaria cycle. 2. The mosquito population is at the steady state since mosquito populations have a relatively rapid turnover 15 3. All malaria cases in month t are treated immediately and recover in month t.16 4. Based on D. L. Smith and McKenzie (2004), when the proportion of infected humans is small, the number of infectious bites received per day by a human (also known as the entomological inoculation rate, taking the form EIRit = mit a2it cit e−µit τit xit ) µit +ait cit xit can be approximated by cit Cit xit , where Cit is the expected number of humans infected per infected human per day, assuming perfect transmission eciency (bit = cit = 1), known as the vectorial capacity. 17 Based on assumption 1, I set w = 1. Based on assumption 2, I solve equation 1.1 for the quasi-equilibrium proportion of infectious mosquitoes as has been done in D. L. 15 According to the CDC, adult female mosquitoes, which spread malaria, do not live more than 1-2 weeks in nature (Center for Disease Control 2015). 16 The incidence data in the analysis is based on diagnosed cases, which are provided with free antimalarial treatment upon diagnosis. The literature shows that within a few days the majority of parasites are eliminated (Nosten and Nicholas J White 2007, N. White 1997). 17 The assumption applies because the analysis is conducted in a low malaria setting. Appendix 1.A provides evidence that the assumption holds for this data. 19 Smith and McKenzie 2004 and Ruktanonchai et al. 2016: yit−1 = ait−1 cit−1 xit−1 e−µit−1 τit−1 µit−1 + ait−1 cit−1 xit−1 (1.3) Assumption 3 implies the recovery rate, ri , is equal to one since all infected individuals recover within the same month. This allows me to focus on new cases of malaria in month t. Since immunity does not develop in low incidence areas, anyone that is not currently infected is susceptible to the disease, which implies that the susceptible population is 1 if everyone recovers. Rewriting equation 1.2 to incorporate the implications of assumptions 1-3: xit − xit−1 = a2it−1 bit−1 cit−1 mit−1 e−µit−1 τit−1 xit−1 + Iit − xit−1 µit−1 + ait−1 cit−1 xit−1 (1.4) Based on assumption 4, equation 1.4 can be rewritten as: xit = bit−1 EIRit−1 + Iit = bit−1 cit−1 Cit−1 xit−1 + Iit Assumption 4 is important because by rewriting EIRit−1 as cit−1 Cit−1 xit−1 , I explicitly incorporate the impact of the incidence last month, xit−1 , on the incidence in the current month using a linear functional form, which helps me approximately estimate the secondary cases generated this month by cases last month. 1.3.2 Empirical Model Now I outline the empirical specication derived from the model developed in the previous section that I will estimate using OLS. From the model, I want to estimate specically bit−1 cit−1 Cit−1 . Due to data limitations, since I want to causally estimate the secondary cases generated by an imported case, instead of estimating a bcC pa20 rameter for each location and each month, I estimate an average bcC parameter across locations and time. By estimating an average parameter, I am able to include health post area xed eects, which allow me to control for unobservable characteristics of a health post area that might impact the incidence. I can also include month xed eects to control for the seasonality of malaria. Therefore, using these controls allows me to identify the impact of imported cases based on deviations from the mean for a given area, controlling for seasonality. I use the mobile phone data to calculate expected imported malaria cases entering and divide them by the population of the area they enter to calculate the imported incidence. The likelihood of an infected case entering depends on the origin of an individual, how long the person spent there and the length of time in the destination. I make two assertions: 1. The likelihood a person, pt , is infected is based on the fraction of the month spent in the district, Tjp , and the monthly incidence rate, xjt 2. The contribution of an imported case to a new location is calculated as a fraction of time spent in the location entered, Tip By using the incidence rate in j, xjt (which includes both locally generated and imported malaria cases), rather than prevalence in j, I account for person pt coming in at time t either being infected in the origin, j , or representing an imported case that came into j from somewhere else rst before entering j and then i. My estimating equation is then: xit = β1 xit−1 + β2 E(Iit ) + αZit + γi + δt + it 1 XX Tip (xjt Tjp ) E(Iit ) = Hit j6=i p ∈j (1.5) (1.6) t where γi are area xed eects, δt are month xed eects and it represents idiosyncratic shocks. The matrix Zit includes zero, one and two lags of rainfall, which capture 21 both the agricultural seasons that could inuence movement and changes in malaria incidence due to environmental factors. Hit is the population of location i in month t. I cluster errors at the health post catchment area level to account for the fact that errors are correlated within panels.18 The main coecients of interest are β1 and β2 , which represent the number of secondary cases generated by infected travelers and the number of primary malaria cases imported by infected travelers. Based on the above model, we have that β2 = 1, which can be tested directly. Based on the model, both local and imported incidence in the previous month have the same eect, β1 , on incidence in the current month, which I will test explicitly. Additionally, based on the model, β1 = bcC . This relationship makes it possible to calibrate whether the magnitude of the coecient estimated is reasonable using estimates of b, c and C drawn from the literature. Based on Vercruysse, Jancloes, and Van de Velden (1983) the average daily vectorial capacity in Senegal, C, is 1.13. Based on Gething et al. (2011), average transmission from infected humans to mosquitoes, c, is 0.161. Finally, the transmission from mosquitoes to humans, b, is 0.05 based on the linear model in D. L. Smith, Drakeley, et al. 2010, which is closest to the model estimated in this chapter. This gives an average daily value of 0.0091, or a monthly value of 0.273 for bcC . xit is calculated as number of new cases from the health post data per 1000 people. Monthly population of i is calculated by adding the number of people entering each month to the annual health post area population and subtracting the number of people leaving each month based on the mobile phone data. The contribution of person pt , who enters i from j at time t, to imported cases is calculated based on the incidence of the district the person is coming from and the length of time spent in district j and health post catchment area i. Only up to 15 days in the origin and up to 15 days in the destination are considered since the human incubation period is 15 18 I include a robustness check with spatial and panel autocorrelated standard errors. 22 days. Therefore, Tip is the proportion of 15 days pt spends in i after entering i and Tjp is the proportion of the month up to 15 days that pt spent in j with monthly malaria incidence xjt in month t.19 Due to the complicated nature of the imported incidence variable, I break down what explains the variation in this variable. Imported incidence combines information on travelers, the incidence where they are coming from, the timing in the destination and the origin, and the population in the destination. Based on a partial R2 of .43 the month explains a large portion of the variation, while .33 of the variation is explained by the health post catchment area. Jointly, they explain about half of the variation (R2 of 0.56), implying that the other half of the variation in the variable of interest is coming from a combination of the month and the location receiving imported cases. This shows how the identication comes from the unique combination of detailed data on travelers and the incidence in the origin. 1.3.3 Identication For my identication to be correct, it is necessary that looking within a health post catchment area over time, any idiosyncratic shocks in malaria incidence are not correlated with expected imported malaria incidence. As described in the background section, agricultural seasons and holidays are the two major reasons for travel. Agricultural seasons are correlated with rainfall, and additionally, rainfall could aect the conditions for travel (quality of roads). As was shown in the graph of malaria cases and rainfall, there is also a very strong relationship between rainfall and malaria. Since I explicitly measure the relationship between rainfall and malaria incidence by including rainfall covariates in my specication, I am able to control for this potential confounder. 19 I use the detailed knowledge of the timing from the mobile phone data to factor in how many of the 15 days were in month t and how many in month xjt and xjt−1 t−1 and use the incidence both in month to determine the probability the person was infected. 23 It is also possible that holidays, which increase population movement, could also aect malaria. For example, people might spend more time outside during holidays and expose themselves to mosquitoes. I will address this potential threat to identication using a placebo test where I scale travelers by average monthly incidence in the country rather than by the incidence of their origin. I also examine the relationship between past and future imported cases and current malaria. Finally, the dynamic panel model with a relatively short panel of twelve time periods could introduce a bias if the error term is now mechanically correlated with the lagged dependent variable on the right hand side (Nickell 1981). I study this by also running a random eects specication and analyzing the dierences between the random eects and xed eects results.20 1.4 Results 1.4.1 Quantifying the Eect of Imported Cases Column 1 of Table 1.1 shows the results from the main specication. For each imported case of malaria per 1000, there are around 1.217 cases per 1000 reported. In addition, for each lagged case per 1000, there is an additional 0.350 of a case generated the following month, representing the negative externality of an imported case the previous month. This specication makes the implicit assumption that the externality from locally generated and imported cases will be the same. I explicitly test this in Column 2 by including lagged imported incidence along with lagged overall incidence. The coecient on lagged imported incidence is not signicantly dierent from 0, which implies that there is no dierential eect between lagged imported and lagged local incidence. I also test some of the other predictions of the model and show 20 In robustness checks included in Appendix 1.D, I also show results from estimates using an augmented version of the Arellano-Bond Generalized Method of Moments estimator designed to address situations with small T, large N panels. 24 results of the tests at the bottom of Table 1.1. The model implies that each imported case per capita should contribute a case per capita to the incidence in the destination in the month entered. With a p-value of 0.65 on the Wald test, the coecient is not signicantly dierent from 1. I also test whether the coecient on lagged incidence is signicantly dierent from the epidemiological parameter prediction of 0.273, and I nd that it is not.21 I simulate a baseline scenario based on the model and compare it to the actual data. I draw values for the parameters of the model from their estimated distributions, which I use to calculate an incidence path for the year. I average these paths across 500 draw. In Panel a of Figure 3.3, I average across health post areas by district and compare to actual incidence. In a similar way, I then simulate incidence under the assumption that there is no imported incidence. Panel b of Figure 3.3 compares the results of this simulation to the baseline simulation. Imported incidence represents a substantial portion of the malaria incidence, especially for Richard Toll and SaintLouis. On average annually per health post, travel represents 34% of the incidence. As mentioned earlier, their could be a dynamic panel bias due to the inclusion of xed eects with a relatively short panel.22 To look at this, I also rerun the analysis using random eects, shown in Column 3 of Table 1.1. The coecient on imported incidence is smaller, while the coecient on lagged incidence is larger. In using random eects, though, I am no longer controlling for time-invariant characteristics of the health post areas that could be correlated with both imported incidence and malaria incidence. In Column 4, I include several characteristics of the health facility areas, including population density, a dummy for urban areas, and a dummy for health facility areas that are not along the border of the country. Including these, we see the coecient on imported incidence is bigger and closer to the coecient 21 Appendix 1.B presents additional results comparing the coecients found in this analysis with detailed data collected in one district on imported cases and secondary cases. 22 It is important to note, though, that the dierence between T and N is not very large, with 12 time periods and 36 panels. 25 from the xed eects model, while the coecient on lag incidence becomes smaller. This suggests that the xed eects might be necessary since there could be additional omitted characteristics of the health facility areas. A Hausman test comparing the two models nds they are signicantly dierent. It is not clear which model is the correct one though since the xed eects model might have some dynamic panel bias while the random eects model might have omitted variable bias. I conduct a validation exercise to calculate a better approximation of the coecient on lagged incidence, detailed in Appendix 1.C. I nd the coecient to be approximately 0.372, which is close to the coecient estimated using xed eects, suggesting that the dynamic panel bias is minimal. Therefore, I use the xed eects model for the rest of the chapter. 1.4.2 Mechanism Evidence I now provide some additional evidence of the causal impact of imported incidence on total incidence. Figure 1.4 shows coecients from a regression of malaria incidence on two lags and two leads of imported incidence, along with location and time xed eects and rainfall controls.23 Malaria incidence two months earlier and one month earlier has no relationship with imported cases in the current period. Malaria incidence in the current period and one month later are associated with imported incidence, as both of those coecients jump. Imported incidence does not seem to have a signicant impact on incidence two months later.24 Since the sample size is much smaller after including the leads and lags, the standard errors are larger. Nevertheless, the trend in the size of the coecient still demonstrates that future imported incidence does not drive current malaria incidence. 23 Appendix 1.A includes a graph where leads of imported incidence are included along with lags of total incidence, rather than of imported incidence. 24 I would not expect a persistent eect of imported cases two months out because the malaria season in these areas is very short, lasting only around 3 months; therefore, there is not enough time for tertiary cases to develop due to the month long incubation periods. 26 I conduct several placebo tests, shown in Table 1.2, to refute potential alternative explanations. The main specication, for which imported cases are calculated based on the incidence in the origin and the lengths of time spend in the origin and destination, is shown in Column 1. One alternative explanation is that the imported incidence variable is just picking up increased travel during certain times in certain locations, such as religious holidays and locations with important religious leaders, and those times might be correlated with higher malaria. For example, during religious holidays people might spend more time outside and are more likely to be bit by mosquitoes. To test this, rather than using the incidence of the location a person is coming from to calculate their probability of importing malaria, I instead use the average monthly incidence across all health districts. In this way, the location of where travelers enter from no longer aects the variable, only the travel patterns do. There is no relationship between this alternative variable and malaria incidence, as shown in Column 2 of Table 1.2. It is also possible that only the incidence of the origin matters. This could be the case if for example travelers tend to come from locations that have similar incidence to the destination, and therefore the variable is capturing this relationship between the locations, and is not related to the travel itself. To test this, I calculate expected imported incidence based on the incidence of the origins. Rather than separately calculating a probability for each person using the length of travel, I use the average number of travelers, and average time spent in place of origin and destination.25 Column 3 of Table 1.2 shows that when imported incidence is calculated for an average number of travelers per destination/origin pair, there is no longer a relationship between this variable and malaria incidence. These two tests demonstrate that it is the interaction of location, number of travelers and time spent that is necessary to produce the signicant relationship with malaria incidence. 25 Average number of travelers was calculated based on total number of travelers and number of unique origin destination pairs in a given month. 27 In Columns 4 and 5, I test that the eect is only seen for malaria incidence. In Column 4, I use a variable for imported disease incidence other than malaria. This variable is based on number of consultations at a health post, removing any cases of malaria or fever and scaling by the population of the health post area. Column 4 shows that this variable has no relationship with malaria incidence in the location where diseases are being imported. In Column 5, I use the original imported malaria incidence variable as the regressor, but I change the dependent variable to be nonmalarial disease incidence, and I use lagged disease incidence instead of lagged malaria incidence. Imported malaria incidence does not have a relationship with non-malarial disease incidence. Since the regressors and dependent variables in the dierent placebos are on dierent scales, I include the beta coecients, which are measured in terms of standard deviations and make it possible to compare coecients across all ve regressions. The beta coecients are ve to sixty times smaller for the placebo tests as compared to the base specication. 1.5 Policy Targeting Strategies Having quantied the negative impact of travel, I now study eective allocation of resources to mitigate it. I rst describe some of the existing policies implemented in Senegal towards travelers. Then, I conduct simulations to demonstrate the eectiveness of targeting policies at only some travelers and in some locations. 1.5.1 Malaria Policies Toward Travelers MACEPA, has specically focused on reducing imported cases in the district Richard Toll. They have used volunteers in the community to alert health workers to the arrival of new travelers. Health workers track down these travelers and ask to test them using an RDT, and treat those that test positive. Based on 2015 data provided 28 by the Richard Toll Health District Director, 3,609 people were identied as travelers, of these 3,386 were tested and 10 tested positive for malaria. In 2015, there were a total of 186 imported cases; therefore, this strategy was only able to detect 5% of imported cases.26 A more systematic policy to target travelers was implemented in one health post in Richard Toll, which is privately run by the Senegalese Sugar Company (CSS) for its workers and their families. CSS hires over 3000 migrant workers every year to help with the sugar harvest. Malaria was a large burden for the company, causing lower productivity, high absenteeism, and high spending on pharmaceuticals to treat it (Djibo and Ndiaye 2013). In late 2011, the CSS implemented a new mandatory policy for all seasonal workers: testing every worker at the beginning of the season using an RDT, treating anyone testing positive, and providing workers and their families with bednets and information. As Figure 1.5 shows, there was a drastic decrease in cases after the implementation of this policy, with case numbers at zero or close to zero after the policy. Data on two types of schistosomiasis among workers at the CSS is also included in the gure and shows no drop in those diseases after late 2011, demonstrating that the drop in malaria cannot be attributed to an overall improvement in the healthcare facility. A strategy to decrease malaria cases that could be harnessed for targeting travelers is proactive community treatment (ProACT) implemented by trained home care providers (HCPs). The pilot of this intervention consisted of HCPs going door-todoor weekly to every household in a village, checking for individuals with symptoms. Compared to villages that did not receive ProACT, the odds of symptomatic malaria were 30 times lower in the intervention villages (Linn et al. 2015). This type of policy could be applied to areas that receive travelers during the weeks when the most expected infections enter. 26 Data on imported cases is based on surveys conducted in Richard Toll. 29 An additional possible policy is to utilize mobile phones to target travelers with information. Mobile phones are always linked to a tower if they are turned on; therefore, the provider knows as soon as an individual changes towers. A targeted text message can then be sent as soon as someone that has opted into the program changes location from a high malaria tower to a low malaria tower, recommending and providing an incentive to get tested at the closest clinic for free. Text messaging has been used by the Ministry of Health in Senegal for providing information on diabetes, and other studies have found that SMS technology can be eective in changing behavior (Senegal Ministry of Health 2016, Aker, Ksoll, et al. 2015, Ksoll et al. 2014, Fischer et al. 2016). 1.5.2 Targeting Simulations The policies in the previous section serve as examples, but I focus the simulations on who a policy should be directed towards, rather than the particular policy. Suppose there were no targeting, then 6,956,197 travelers would need to be targeted and 602 malaria cases, representing around 40% of total cases in this area, would be treated or prevented in the analysis area. Even if the average cost per traveler were only $0.50 (around the cost of an RDT), this would amount to close to $6,000 per case. Compared to a typical benchmark for cost-eectiveness of $150, targeting all travelers would be 60 times as costly. Therefore, if a policy were to target travelers, only a subset should be targeted. I base the targeting strategies on the district the individuals are coming from and the month in which they are traveling. Therefore, if a district-month is chosen as receiving targeting, then every person traveling from that district to the pre-elimination area in that month would be treated with the policy. I assume whatever policy is chosen to target, there is a uniform cost per person that is the same for all strategies and the policy works with 100% eectiveness across all strategies. The cost for targeting 30 is based on a cost per traveler targeted that I assign as $0.50 per traveler and a xed cost based on a cost of $309 per health facility, amounting to a total of $47,586 for the area.27 I rst consider four strategies for choosing district-months to target assuming everyone from a chosen district month entering any of the ve low malaria districts would be targeted. The rst is the government randomly chooses district-months to targeted. For the second, the government chooses district-months randomly within the malarial season (August to December) before randomly choosing district-months during the rest of the year. The third strategy considered is one where the government uses monthly malaria incidence in the prior year to order the district-months from those with the highest to the lowest incidence. The fourth strategy calculates a costbenet value for each district-month based on the number of travelers and the xed cost and uses incidence from the previous year to calculate the total impact of the imported cases (the benet). The top panel of Figure 1.6 shows the full cost curves, from a strategy where only one district-month is treated, all the way up to all districtmonths being treated. These were calculated using simulations based on the earlier model to calculate the total primary and secondary cases attributed to travelers. Depending on the resources available to the government, it is possible to determine for any given budget how many cases would be treated or averted depending on the targeting strategy chosen. Zooming in on the curve below $200,000, panel b of Figure 1.6 shows the strategies of targeting randomly or randomly during the malarial season are extremely inecient. Randomly targeting districts during the malaria season is the closest strategy to the one attempted in Richard Toll using volunteers to alert health workers of new travelers. In this simulation, I nd that with this random targeting during the malar27 The xed cost was calculated based on the Fiscal Year 2015 Malaria Operational Plan in which $500,000 was budgeted for training, supervision and monitoring of sta at 1,620 health huts in Senegal. Assuming a plan to target travelers would be implemented at the community level and require similar resources, I assign the cost proportionally to areas. 31 ial season, if around 10,000 travelers are targeted, we would expect around 3 cases of malaria averted. If instead, 10,000 travelers were targeted based on the cost-eective strategy, around 77 cases of malaria would be treated or averted. Until now, the assumption has been that while individuals coming from certain locations are targeted, the targeting itself would occur throughout all ve districts uniformly. Under this scenario if the government spends up to $100,000, on average the cost eective strategy for choosing district-months does around 25 percent better than the strategy of using just information on incidence from the year before. Yet, the cell phone data analysis can provide additional information for targeting because it is not necessary to train all health workers in the ve districts since there are particular hotspots that receive the majority of high-risk travelers. Therefore, I consider scenarios where travelers coming only into particular health post areas are targeted, which decreases the xed cost since only the workers in those areas need to be trained. Panel c of Figure 1.6 demonstrates the cost curves if only 6, 12, 18 or 24 health facility areas were targeted and compares this to strategies 3 and 4 of using incidence from the previous year or targeting in all 36 health post areas.28 At a relatively low cost, it now becomes possible to decrease incidence if resources are focused on particular health facility areas and travelers coming from specic districts in certain months. Taking the most cost eective strategy going up to $100,000 of targeting within seven health facility areas, on average it does 2.2 times as well as the strategy of only utilizing information on incidence from the year before. On average it does 77% better than the targeted policy that is implemented throughout all ve districts. 28 The health facility areas included were chosen based on cost eectiveness. 32 1.6 Robustness Checks I do several robustness checks in Table 3.8 to test the main specication, which is displayed in Column 1. Column 2 uses an estimate of imported incidence without weighting the mobility data to be representative of the full population. This assumes that the only movement in the country is the movement I see in the data, which would be an underestimate. As discussed, weighting the data represents an upper bound for the level of movement. The actual movement that occurs, and therefore, imported incidence, is then somewhere in between these lower and upper bounds. Thus the real coecient should also lie between these upper and lower bounds of 1.2 and 1.7. In addition to weighting the movement, Column 3 scales both imported incidence and total incidence by health post utilization at the region level. The results are similar and not signicantly dierent from the results in Column 1. As discussed, utilization is relatively uniform throughout the country; therefore, incorporating utilization does not signicantly change the results. In Column 4, I rerun the specication using Conley standard errors that account for spatial autocorrelation and serial autocorrelation over time (Hsiang 2010).29 The results remain signicant. In Column 5, I use net imported incidence, subtracting expected infected travelers leaving the health post catchment area. This helps to remove some of the noise from individuals that might be living on the border between two districts and traveling back and forth constantly, since on net, their travel would be close to 0. Both the coecients on imported and lagged imported do not change signicantly and they remain signicant. In Column 6, I more explicitly account for individuals that might be living on the border between two districts and traveling between the two. Since malaria is not determined by borders but by geographic location, the prevalence close to the border on either side should not be very dierent, and therefore, travelers on 29 I use a 30km cuto for the spatial correlation and two lags for the autocorrelation. The standard error on current imported incidence becomes smaller because when correcting for the spatial and autocorrelation it is not possible to account for panel clustering. 33 the border that go back and forth within a small radius are not expected to impact the incidence. Column 6 shows that the coecients remain signicant but increase and the standard error on imported incidence becomes larger. This could indicate that neighboring travelers from further away are important for malaria transmission, and omitting them leads to an upward bias in the coecient.30 1.7 Conclusion The chapter set forth to quantify the link between short term population movement and disease incidence in the context of malaria in Senegal in order to study how policies could be targeted eectively to mitigate this negative externality. This type of study has not been possible before due to either a lack of comprehensive data on short term movement at the scale of a country, or else a dearth of detailed data on disease incidence. New big data collected by companies such as cell phone providers is now making the measurement of short term movement possible. In addition, due to the initiative taken by the Ministry of Health to work on eliminating malaria completely from the north of the country, surveillance data has been collected on a monthly level for each health post in the North, which makes it possible to study how short term movement into these areas can increase incidence of malaria. The study nds that for each imported case of malaria, there are around 1.6 cases of malaria reported at healthpost areas in the North. Using these ndings, I calculate that implementing a policy in certain areas and targeting the most cost eective districts during certain months results in the largest drop in cases, with up to 2.2 times as many cases treated or averted as compared to the next best strategy. This is the rst study to evaluate the cost eectiveness of dierent targeting schemes directed at travelers. While the work here focuses on malaria, it is possible to implement these types 30 Additional robustness checks are available in Appendix 1.D. 34 of models for other infectious diseases such as inuenza, Ebola, or Zika. As cell phone usage has become extremely prevalent throughout the developing world and cell phone providers are beginning to understand how the data they collect can be used by policy makers to implement better policies, measuring short term movement becomes easier. What then becomes necessary is the collection of high frequency data on infectious diseases. This data will make it possible to study these models that help policy makers better target interventions, and in turn will make it easier to evaluate the interventions if case data are already being collected. This could help countries lower the disease burden from existing infectious diseases and prevent epidemics of new diseases, leading to benecial economic and social consequences. 35 Figure 1.1: Average Monthly Healthpost Catchment Area Malaria Incidence per 1000 and Average Monthly Rainfall, Jan 2013-Dec 2015 by Health District 36 Figure 1.2: People Entering a Health Post Catchment Area as a Percent of the Population in the Area (a) Avg Across All Health Post Catchment Areas (b) Healthpost Catchment Areas in Richard Toll Notes: Red dash lines in panel (a) represent some important religious holidays and pilgrimages. The scale in panel (b) is bounded at 15%, but for Diama Savoigne and Dabi Tiguette Djoudj, the value in January goes up almost to 50%. 37 Figure 1.3: Predicted with and without Imported Cases and Actual Incidence Averaged Across Health post Areas by District (a) Goodness of Fit (b) Eect of Travelers Notes: The orange dash lines represent the monthly predicted malaria incidence averaged across health post areas within a district. This was calculated based on values for the parameters of the model drawn from their distributions. I conducted 500 replications and used the mean monthly incidence value per health post area. Panel a compares the predicted values to the actual malaria incidence, where the solid green line is actual incidence averaged across health posts within a district. In panel b, the predicted incidence is compared to a scenario where no cases were imported by travelers, shown in dashed blue lines. Incidence with 0 imported cases was calculated using the same 500 replications for parameter values, but imported cases were set to 0. 38 Figure 1.4: Estimated Impact of Future, Current and Past Expected Imported Malaria Incidence Notes: The gure was constructed based on a regression of current malaria incidence on imported incidence of malaria two months later, one month later, currently, last month and two months ago, controlling for time and location xed eects and rainfall covariates and clustering errors at the health post area level. 39 Figure 1.5: Eect on Malaria Cases of a Policy Targeting Migrant Workers at the Senegalese Sugar Company Notes: The gure shows number of cases of malaria and two types of schistosomiases seen at the health post of the Senegalese Sugar Company. The red vertical line marks the timing of when a new policy was implemented by the company that tested every migrant worker for malaria and treated those that tested positive. Data was provided by the CSS. 40 Figure 1.6: Cost and Benet Under Dierent Strategies (a) Full Cost Curve (b) Cost Curve Zoomed in on Less than $200,000 (c) Targeting Particular Health Facility Areas, Cost Curve Zoomed in on Less than $200,000 Notes: Panels (a) and (b) show four dierent strategies for targeting travelers. Each symbol rep- resents a district-month. Targeting a specic district-month means targeting all travelers entering the ve low malaria districts from that district in that month. The strategies lay out which districtmonths are targeted rst. In addition to considering district-months to target, panel (c) also shows strategies for targeting only particular health facility areas in the ve low malaria districts. The cost is calculated based on a variable cost of $0.50 per traveler and a xed cost of $47,586 split proportionally between health facility areas. The benet is based on simulations using 500 draws of the parameters of the model to calculate the number of primary and secondary cases generated by travelers from each district in each month. 41 Table 1.1: Eect of Imported Malaria Incidence Imported Incidence Lag Imported Incidence Lag Incidence Rain in cm Lag Rain in cm Lag 2 Rain in cm Constant Month FE Health Post Area FE Health Post Area Controls Health Post x Month Obs R-squared (1) Fixed Eects (2) Fixed Eects (3) Random Eects (4) Random Eects 1.217** (0.471) 0.789* (0.419) 0.862** (0.402) 0.350*** (0.0535) 0.00711 (0.00976) 0.0266 (0.0185) 0.0372** (0.0152) -0.0360 (0.0312) 1.150** (0.516) 0.117 (0.246) 0.346*** (0.0553) 0.00668 (0.0101) 0.0271 (0.0183) 0.0377** (0.0151) -0.0372 (0.0309) 0.463*** (0.0635) 0.00525 (0.00948) 0.0268 (0.0172) 0.0371*** (0.0132) -0.0396* (0.0226) 0.444*** (0.0605) 0.00626 (0.00889) 0.0277 (0.0171) 0.0379*** (0.0135) -0.0548** (0.0257) Yes Yes No 396 0.513 Yes Yes No 396 0.513 Yes No No 396 0.504 Yes No Yes 396 0.507 Hausman Test Comparing Column 1 and Column 4 p-value=0.003 Test Imported=1 p-value=0.647 Test Lag Incidence=0.273 p-value=0.157 Robust standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 Notes: Observations clustered at the health post catchment area level. Column 4 includes controls for health post area population density, a dummy for urban health post areas and a dummy for non-border health post areas. Based on the predictions of the epidemiological model, the coecient on imported incidence should be 1 and the coecient on lag incidence should be 0.273. 42 43 0.0134 0.0142 (0.0632) 0.364*** (0.0647) 0.00297 (0.00909) 0.0276 (0.0178) 0.0441*** (0.0145) -0.0340 (0.0311) 0.000647 0.000671 (0.0708) 0.364*** (0.0648) 0.00293 (0.00906) 0.0275 (0.0174) 0.0440*** (0.0146) -0.0335 (0.0310) (3) Avg Travel Scaled by Monthly Incid Observations clustered at the health post catchment area level. Yes Yes 396 0.490 0.000929 0.0141 (0.00870) 0.364*** (0.0645) 0.00296 (0.00905) 0.0274 (0.0177) 0.0440*** (0.0144) -0.0353 (0.0366) (4) Travel Scaled by Non-Malarial Disease Incid Yes Yes 396 0.306 16.26 0.0483 (20.37) 0.245** (0.0961) 0.163 (0.728) 0.191 (0.389) -1.150** (0.460) 24.79*** (2.494) (5) Eect on Non-Malarial Disease The beta coecients measure the eect in terms of standard deviations, Yes Yes Yes Yes Yes Yes 396 396 396 0.513 0.490 0.490 Robust standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 1.217** 0.237 (0.471) 0.350*** (0.0535) 0.00711 (0.00976) 0.0266 (0.0185) 0.0372** (0.0152) -0.0360 (0.0312) (2) Travel Scaled by Avg Monthly Incid dependent variable is non-malarial disease incidence while imported malaria incidence is calculated as is done in the baseline specication. constructed in an alternative way that does not incorporate both incidence in the origin and the timing of the particular traveler. In Column 5, the making it possible to compare coecient magnitudes across the ve regressions. Columns 2-4 show regressions where imported incidence has been Notes: Month FE Health Post Area FE Health Post x Month Obs R-squared Constant Lag 2 Rain in cm Lag Rain in cm Rain in cm Lag Incidence Imported Incidence Imported Beta Coecient (1) Baseline Model Table 1.2: Placebo Tests 44 396 0.513 1.739** (0.673) 0.350*** (0.0535) 0.00711 (0.00976) 0.0266 (0.0185) 0.0372** (0.0152) -0.0360 (0.0312) 1.378** (0.612) 0.356*** (0.0534) 0.0181 (0.0253) 0.0694 (0.0474) 0.0955** (0.0388) -0.0939 (0.0800) (3) Scaled by Utilization 1.217*** (0.298) 0.350*** (0.0674) 0.00711 (0.00852) 0.0266 (0.0164) 0.0372** (0.0156) (4) Spatial and Autocorrelation SE 396 396 396 0.513 0.511 0.665 Robust standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 1.217** (0.471) 0.350*** (0.0535) 0.00711 (0.00976) 0.0266 (0.0185) 0.0372** (0.0152) -0.0360 (0.0312) (2) Unweighted 396 0.514 1.369*** (0.472) 0.351*** (0.0532) 0.00682 (0.00972) 0.0254 (0.0184) 0.0370** (0.0152) -0.0333 (0.0310) (5) Net Imported 396 0.510 1.545* (0.828) 0.367*** (0.0533) 0.00631 (0.0104) 0.0272 (0.0187) 0.0340** (0.0149) -0.0348 (0.0312) (6) No Neighbors population of the area. Column 6 only includes movement from districts that do not share a border. on DHS data. In Column 5, I calculated net imported incidence by subtracting out the expected cases leaving a health facility area divided by the up to the full population of Senegal. In Column 3, I scale the expected imported incidence and total incidence by the utilization of the region based autocorrelation over time, but do not cluster at the health post catchment area level. In Column 2, I do not weight movement observations to scale regressions except the results shown in Column 4. In Column 4, I use Conley standard errors, which account for spatial autocorrelation and serial Notes: Month and health post area xed eects used in all specications. Standard errors clustered at the health post catchment area level for all Health Post x Month Obs R-squared Constant Lag 2 Rain in cm Lag Rain in cm Rain in cm Lag Incidence Imported Incidence (1) Baseline Model Table 1.3: Robustness Checks 1.A Additional Tables and Figures Figure 1.A.1: Annual Malaria Incidence in Senegal in 2013 Notes: Data come from tested and conrmed cases at the health district level compiled by the National Malaria Control Program. 45 Figure 1.A.2: Senegal Health Districts and Location of Health Posts in the North Used in the Analysis Notes: The ve very low malaria health districts used in the analysis are subdivided into health post catchment areas that group health posts and mobile phone towers together. 46 Figure 1.A.3: Estimated Impact of Future Expected Imported Malaria Incidence and Past Total Incidence Notes: The gure was constructed based on a regression of current malaria incidence on imported incidence of malaria two months later, one month later and currently as well as total incidence last month and two months ago, controlling for time and location xed eects and rainfall covariates and clustering errors at the health post area level. 47 Figure 1.A.4: Testing the Approximation of the Entomological Innoculation Rate in Assumption 4 Notes: For each health post area in each month, the entomological inoculation rate is calculated as is the vectorial capacity times transmission to mosquitoes times incidence (cCx). Based on Assumption 4, in low malaria districts it should be possible to approximate EIR with cCx. In the scatter plot, this means the points should be along the 45 degree line, which is shown in red. The graph shows that for the districts analyzed here, the assumption holds. The EIR and cCx were estimated using values from the literature for the various biological malaria parameters. Based on Gething et al. (2011), average transmission from infected humans to mosquitoes, c, is 0.161. The incubation period, τ, is 9 days (Killeen, A. Ross, and T. Smith 2006) and the bites on humans per mosquito, a, is 0.3 (Ruktanonchai et al. 2016). Average mosquito lifespan of 12.5 is used. 48 1.B Primary and Secondary Case Detection in Richard Toll In the health district of Richard Toll, MACEPA has been implementing what is known as reactive case investigation since 2012. When a patient tests positive for malaria, an investigation begins whereby within seven days detailed information, including travel history, is collected on the individual as well as every person living in the same household, and all household members are tested for malaria as well. In addition, malaria treatment is given to all household members. Finally, all individuals living in households within 100m of the initial positive case are tested for malaria, and those in a household where a positive case is found are treated for the disease. Data on all the primary and secondary cases between 2013 and 2015 are available, and provide some measure of the spread of malaria from both imported and local cases. Primary cases were only investigated when the infected individual stayed in a household within Richard Toll. This type of data collection is very costly in both time and resources because it requires sending out health workers to individual households and tracking down the people that live there and in the proximity. The analysis in this one district can help validate how well the mobile phone data analysis does. Nevertheless, doing this type of survey on a large scale would be much more costly as compared to using the mobile phone data over a larger geographic area. Table 1.B.1 shows the case numbers for each year between 2013 and 2015, split between imported and local as well as the number of secondary cases found through the investigations of the primary cases. The table shows that for each primary imported case there are on average between 0.09 and 0.15 secondary cases and for each local case there are on average between .12 and .23 secondary cases. These values are lower than what the main specication shows for secondary cases generated in the follow- 49 Table 1.B.1: Primary and Secondary Cases in Richard Toll District 2013 2014 2015 (1) Primary Cases (2) Secondary Cases (3) # of Secondary Cases per Primary Case Imported Local Total Investigated 121 87 208 18 20 38 0.15 0.23 Imported Local Total Investigated 85 33 118 8 6 14 0.09 0.18 Imported Local Total Investigated 178 68 246 20 8 28 0.11 0.12 Notes: Travel history data were collected by MACEPA on all positive cases of malaria and individuals within the household or geographically near households were tested to pick up secondary cases. ing month. The lower values for secondary transmission can be explained by the fact that the reactive case investigation provides medication to everyone in the household, regardless of the outcome of the test. Therefore, someone that might have recently been infected and does not yet have enough of the parasite to test positive using an RDT, would not be counted as a secondary case, even if he or she might have been one without the preemptive treatment. It is also possible that this particular district has fewer mosquitoes as compared to the other four districts, making spread to secondary cases lower. This would make sense given the more intense policy implementation in this district, which acts as a pilot area for most malaria programs before they are rolled out to the rest of the country. Given these caveats, our coecient of 0.35 on average for all ve districts is in line with the results in this particular district. 50 1.C Validation Exercise Since both the xed eects and random eects models could have potential bias, I conduct an exercise to get a better approximation for the coecient. The basis of the exercise comes from having equations for my dependent variable at times 1, 2 and 3, each of which is based on the lag of the variable. This is shown in equation array 1. xi1 = β1 xi0 + γi + i1 (1.7) xi2 = β1 xi1 + γi + i2 (1.8) xi3 = β1 xi2 + γi + i3 (1.9) Subtracting equation 1 from equations 2 and 3: xi2 − xi1 = β1 (xi1 − xi0 ) + (i2 − i1 ) (1.10) xi3 − xi1 = β1 (xi2 − xi0 ) + (i3 − i1 ) (1.11) I can then estimate βˆ1 for each equation, and I will call the βˆ1 estimates from equations 4 and 5 C1 and C2 respectively. These should have the following forms: σ2 σ((xi1 − xi0 ), (i2 − i1 )) = β − 1 σ 2 (xi1 − xi0 ) σ 2 (xi1 − xi0 ) σ((xi2 − xi0 ), (i3 − i1 )) β1 σ2 C 2 = β1 + = β − 1 σ 2 (xi2 − xi0 ) σ 2 (xi2 − xi0 ) C1 = β1 + (1.12) (1.13) I can calculate the values for C1 and C2 by directly running regressions 4 and 5 with my data. My actual specication includes additional variables (imported incidence, rainfall and its lags, and month xed eects); therefore, to obtain the simplied equations, I regress xi2 −xi1 and xi1 −xi0 , along with the analogous values for equation 5, on the additional covariates and use the residuals for the simplied regressions. These residualized regressions allow me to obtain estimates for C1 and C2 . I then can 51 also obtain the variances of the residualized xi1 − xi0 and xi2 − xi0 , which then leaves two equations with two unknownsσ2 and β1 . Solving for σ2 in equation 6: σ2 = σ 2 (xi1 − xi0 )β1 − σ 2 (xi1 − xi0 )C1 (1.14) Substituting equation 8 into equation 7: σ 2 (xi1 − xi0 )β12 − (σ 2 (xi2 − xi0 ) + σ 2 (xi1 − xi0 )C1 )β1 + σ 2 (xi2 − xi0 )C2 = 0 (1.15) Estimating equations 4 and 5 and the two variances I get: C1 = −0.174 (1.16) C2 = 0.288 (1.17) σ 2 (xi1 − xi0 ) = 5.254 (1.18) σ 2 (xi2 − xi0 ) = 12.66 (1.19) Solving for beta1 , I get β1 = 0.372 or β1 = 1.863. This implies σ2 = 2.87 or σ2 = 10.7. We know the variance of the residual from estimating equation 4 is σ 2 (3 − 1 ) = σ 2 (3 ) + σ 2 (1 ) ≈ 2σ2 . From the data, I calculate the variance of the residual to be 6.32, implying σ2 = 3.16. Since the variance of 2.87, which I get when β1 = 0.372 is closest to this, it must be that β1 = 0.372. 52 1.D Additional Robustness Checks I do several additional robustness checks to supplement those provided in the main text. Column 1 of Table 1.D.1 shows the baseline specication as comparison. In Column 2, I show results from rerunning the analysis using cases rather than incidence. The model remains the same, with the only change being that both sides are multiplied times the population in location i. Therefore, the coecients should remain around the same size since the hypothesis for β1 and β2 are the same. Indeed, the coecients are very similar to the main specication using incidence. The coecient on imported cases is still not signicantly dierent from 1, although it is now significant at the .01 level because the standard error has decreased and the coecient is larger. The coecient on lagged cases is almost identical to the main specication. In Column 3, I show results from using an augmented version of the Arellano-Bond Generalized Method of Moments estimator designed to address situations with a small number of time periods and a large number of panels. As discussed extensively in the chapter, there does not seem to be evidence for a large dynamic panel bias, which could be due to the fact that the number of panels is not much larger than the number of time periods. Nevertheless, I include these results, which show that the coecient on imported incidence decreases, but is still not signicantly dierent from 1, as is predicted by the model. The coecient on lagged incidence increases signicantly, but this is standard when using an instrumental variable, as is done with the Arellano-Bond GMM estimator. In Column 4, I loosen one of the implications of assumption 3, that the susceptible population is equal to 1, and instead I calculate the susceptible population and I scale lagged incidence by this susceptible population to account for the fact that those that are already infected will not become a secondary case in the following month. 53 This scaling does very little to the regression coecients, which is expected since on average the susceptible population is 0.997. Finally, in Column 5, I show results where I restrict the movements in the mobile phone data that I include in calculating expected imported cases. I remove any movement where the SIM card had no calls or texts for more than two weeks before or two weeks after. Without information for each day, I do not know the exact time when an individual moved between districts. If only a few days are missing, this won't necessarily aect the analysis, but with over two weeks missing, the person might have entered much earlier and therefore would have shown up as a malaria case much earlier, and I would be attributing them to the wrong time period. Nevertheless, excluding these moves does not have a large eect, except that it increases the coecient on imported incidence. This is to be expected since by removing the trips surrounded by missing data, I now am accounting for less trips in the data and therefore attributing more impact per trip seen. 54 55 1.349*** (0.211) 0.352*** (0.0762) -0.128 (0.175) 0.0681 (0.659) 0.709 (0.583) 1.070*** (0.376) 0.619*** (0.0716) 0.000636 (0.00822) 0.0243 (0.0178) 0.0322** (0.0148) -0.415 (0.264) Arellano-Bond Cases 396 0.664 1.212** (0.493) 0.351*** (0.0562) 0.00653 (0.00998) 0.0255 (0.0188) 0.0364** (0.0156) (4) Scale by Susceptible Pop 396 432 396 0.513 0.761 Robust standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 1.217** (0.471) 0.350*** (0.0535) 0.00711 (0.00976) 0.0266 (0.0185) 0.0372** (0.0152) -0.0360 (0.0312) (3) (2) 396 0.510 1.523** (0.745) 0.353*** (0.0544) 0.00686 (0.00989) 0.0255 (0.0184) 0.0374** (0.0148) -0.0352 (0.0311) (5) Restrict Phone Data In card had a call or a text within 14 days of when the movement is counted. Column 4, lag incidence is scaled by the percent of population susceptible to malaria. In Column 5, movements are restricted to those where the SIM Generalized Method of Moments estimator designed to address situations with a small number of time periods and a large number of panels. regressions. In Column 2, I use cases rather than cases per 1000 people. Column 3 shows results from an augmented version of the Arellano-Bond Notes: Month and health post area xed eects used in all specications. Standard errors clustered at the health post catchment area level for all Health Post x Month Obs R-squared Constant Lag 2 Rain in cm Lag Rain in cm Rain in cm Lag Incidence Imported Incidence (1) Baseline Model Table 1.D.1: Robustness Checks Chapter 2 Predicting Dynamic Patterns of Short Term Movement and Implications for Studying Malaria Transmission 2.1 Introduction Short term human mobility has important health consequences as it has been linked to the propagation of diseases such as inuenza, malaria, and HIV/AIDS (Oster 2012, Adda 2016, Balcan et al. 2009, Wesolowski, Eagle, et al. 2012). Yet measuring short term movement, especially in low income countries, has been dicult, requiring expensive surveys that cannot be implemented over a wide geographic area. More recently, mobile phone data has made it possible to study short term movement for a whole country at a higher spatial and temporal granularity. But the proprietary nature of the data and privacy concerns can prohibit use by researchers or policy makers. This chapter is one of the rst to study how the temporal dynamics of short term movement measured by cell phone data can be predicted. To study short term movement, I focus on two main drivers: economic and social. 56 Two types of economic factors are individuals traveling for seasonal work or traveling to a big market or commercial center. People also take short trips to visit family or friends during religious holidays and vacations. In past literature, long term migration networks measured through census data were used to study short term movement (Wesolowski, Buckee, et al. 2013, Wesolowski, Stresman, et al. 2014, Ruktanonchai et al. 2016, Tizzoni et al. 2014). Yet, while these can capture some of the linkages between locations that would drive short term movements such as social visits, in order to be able to study the temporal changes in short term movement over the course of a year, it is necessary to capture the short term shocks that lead to increased movement during certain points in time. In this study, measures of long term networks are combined with short term shocks and drivers in order to estimate not only the average relative size of short term movement between locations, but the temporal changes in short term movement, which has not been done in the literature before. Putting together data that is available through free, online sources for the majority of low income countries, I use Senegal as a case study to predict short term movement within the country. Mobile phone records for over 9.5 million subscribers over the course of a year are used to estimate short term movement patterns and predict these patterns on a monthly basis based on a combination of long term networks and short term drivers. The predictions are then used to study the relationship between malaria and population movement to exemplify the possibility of using such predictions in disease related policy making. Considering the seasonality of many diseases such as malaria or inuenza, it is necessary for policy makers to understand the temporal changes in population movement which can be an important factor in spread of disease. In many countries though, it is not possible to measure short term movement directly. This chapter demonstrates how data that is widely available for the majority of countries can be used to measure long term networks and shocks that can predict short term movement and allow policy makers to target the negative 57 impact of population movement on the propagation of diseases. 2.2 Data I compile a unique dataset that brings together variables from a number of sources in order to capture the drivers of migration. These variables are meant to measure the long term population networks, as well as short term economic and social drivers. The data are aggregated for the 76 health districts of Senegal, and the time scale is monthly. The dataset created contains an observation for every district pair for every month in 2013. The data used to measure short term movement come from phone records made available by Sonatel and Orange in the context of the Data for Development Challenge. The data consist of call and text data for Senegal between January and December 2013 for 9,569,426 subscribers. For each call and text, the data contain the time, date and GPS location of the closest cell phone tower. The data is anonymized, with subscribers represented by a unique random ID. The health district location of the last call of the day is assigned as the location for each day. For days with no call or text, the location of the day closest to the one missing is assigned. Movement is dened as a change in location from one health district to another between two consecutive days. Trips lasting only one day or between neighboring districts are removed in order to minimize noise from individuals only passing through a district on their way to a new location or individuals living on the border between two districts. The dependent variable is the square root of total number of people entering a district from another district in a given month. Long term network links between districts are measured using a 10% sample of the census data from 2013. Long term migration between districts is measured based on the question asking where the individual was located one year ago. Five variables 58 for average literacy in dierent languages (French, Arabic, Wolof, Pulaar, and Sereer) are used as a measure of ethnicity in order to account for ethnic networks. In addition measures of population size and distance between health district centroids are used. These are to proxy for the fact that networks are often driven by proximity, and the trend of urbanization. Two types of economic shocks are accounted for: agricultural and urban. I use EVI, the enhanced vegetation index, and rainfall as a monthly measure of agricultural activity to capture the agricultural seasons and crop level. EVI comes from monthly MODIS satellite pictures. Pixel values within the health district are averaged. Rainfall is from the NOAA Climate Prediction Center Rainfall Estimation Algorithm. Pixel values are averaged daily for the district and summed monthly to create a cumulative monthly measure. Nighttime lights data is used as a measure of monthly urban economic activity. These data are from the monthly NOAA Nighttime VIIRS DNB Composites for 2014. The lights values for where someone is coming from and going to are also interacted. Both linear and quadratic terms are included for all lights variables. There are two types of social shocks that I measure, religious holidays and pilgrimages, when large portions of the population travel to visit family, friends or an important religious site. Dummies are included for the two months in which the two largest Muslim holidays occur. The dummies are also interacted with the long term migration patterns, since often people that have migrated within the country go to their home village for these holidays. Dummy variables are included for the months in which the two largest pilgrimages occur, as well as dummies for the month and districts of the pilgrimages. While the majority of people travel to the two main sites, individuals will also travel to smaller important religious sites around the country during those pilgrimage days. Finally, the pilgrimage dummies are also interacted with the percent of the region following the religion that conducts the pilgrimage, 59 based on a 10% sample of the 2002 census, in order to account for areas that are more likely to take part in the pilgrimage. 2.3 Empirical Analysis and Results Running a regression of short term movement on the variables described in the previous section for a random subset of 3/4 of the district pairs, results in an R2 of 0.69 and a root mean square error of 8.82. Out of sample prediction on the quarter of districts left out results in an rmse of 9.17. A scatter plot of total predicted people entering each district in a given month versus actual people entering based on the phone data in Figure 2.1 shows that the majority of green points lie along the blue 45◦ line. The axes from month to month demonstrate variation in the number of people entering, especially for a few key districts, which would be missed if only predicting aggregate annual movement. Figure 2.2 shows average monthly district movement at the regional level. It demonstrates how the prediction algorithm is able to model the variation that occurs month to month and could have important implications for disease propagation. The social shock variables and the ethnic network ones are specic to Senegal and would not apply in a dierent country context where short term movement prediction might be necessary. Therefore, I also conduct an analysis where I remove these country-specic variables. The R2 decreases to 0.63, but as the orange triangles in Figure 2.1 show, the regression still predicts a large portion of the variation. The only points for which the prediction is now drastically dierent are the points for the district of Touba in January and December when the largest pilgrimages occurred that brought millions of people to the district. Since variables for these pilgrimages are no longer included, the number of people entering is underestimated. Taking the coecients from this limited regression, Table 2.1 shows the standard deviation 60 change in short term movement resulting from a standard deviation change in each variable. This helps illustrate which variables are most important for predicting short term movement. While long term network variables, especially proximity, play a critical role in determining short term movement patterns, predictors of urban and agricultural shocks are also important. In order to test how well the predictions do in the case of studying the link between population movement and disease, I look at the relationship between short term movement and malaria. The original predicted values for short term movement are scaled by the malaria incidence in the district individuals are coming from in order to calculate a value for expected imported malaria incidence. Incidence data is available monthly at the district level from the National Malaria Control Program. Figure 2.3 shows a scatter plot of the number of people entering based on mobile phone data scaled by incidence along with predicted people entering scaled by incidence. The blue 45◦ line shows that there exists a strong correspondence between the actual and predicted values. Using the constructed expected imported incidence measure, I conduct a regression analysis to test its correlation with malaria incidence at the district level. I control for other characteristics such as rainfall, month and district xed eects. This analysis is also done using the predicted movement scaled by incidence, predicted rst using all variables and then limiting to non-country specic variables. Figure 2.4 shows the coecients on people entering scaled by incidence along with the 95% condence intervals. The gure shows that using the predicted values would provide a very similar estimate to the coecient using measured short term movement. Conducting Wald tests for equality between the actual coecient and the predicted movement ones using all and non-country specic variables both fail to reject the null hypothesis of equivalence (p values of 0.19 and 0.65 respectively).This demonstrates how even if short term movement data is not available, it could be possible to estimate it based 61 on available datasets, and utilize it in studying disease dynamics. 2.4 Discussion and Conclusion As populations become more mobile, short term movement will continue to have more important consequences for health. Understanding the movement and its consequences is critical for policy makers looking to prevent the spread of communicable disease. The emergence of mobile phone data as a source for measuring short term movement has revolutionized the way these questions are studied, but we are still far from having this data available and usable by researchers in all low-income countries. In the meantime, it is important to nd ways to estimate these movements. While past research has shown that it is possible to estimate the aggregate short term movement patterns using census migration data so that a ranking of movement can be reproduced that mimics the actual ranking of short term movement, it does not estimate the dynamics of short term movement. Especially for seasonal diseases, the dierences in movement patterns from month to month are important to understand. In this chapter, I show how combining a number of data sources that are available for the majority of low income countries can be utilized to estimate short term movement. Not only is it possible to estimate the movement, but the predictions can provide good estimates for the eects of short term movement. While the estimated values are more accurate when local factors such as specic holidays and ethnicities are incorporated into the analysis, using only variables that are common across countries produces good estimates of short term movement patterns as well. This suggests it might be possible to take this model to a country where mobile phone data is not available in order to estimate short term movement in this dierent context. While this should be tested rst in the case of a dierent country where short term movement data is available and it is possible to test the accuracy of the predictions, this 62 chapter provides the basis for such future analysis. 63 Figure 2.1: Actual versus Predicted Short Term Movement Notes: People entering a health district in each month of 2013 as measured in the mobile phone data vs two types of predicted values: one where all static and shock variables are used, and a second where only those variables that are not country-specic are used 64 Figure 2.2: Actual and Predicted Average Number of People Entering a District per Month in 2013, by Region in '000s 65 Figure 2.3: Actual versus Predicted Expected Imported Incidence of Malaria Notes: Number of people entering a health district in a month measured in the mobile phone data scaled by malaria incidence in the district they enter from versus their predicted value scaled by incidence 66 Figure 2.4: Coecient on Expected Imported Malaria Cases Notes: The left point is the coecient from a regression of malaria cases on expected imported cases measured by the cell phone data. The middle point is the coecient from a regression of malaria cases on expected imported cases measured from predicted movement based on all characteristics. The right point is the coecient from a regression of malaria cases on expected imported cases measured from predicted movement based on characteristics not unique to Senegal. 67 Table 2.1: Standard Deviation Change in People Entering for a Standard Deviation Change in Each Variable Long Term Networks Economic Shocks District Distance District Distance2 1.065 0.822 Lights To x Lights From Lights To 0.355 0.194 Population From 0.308 Lights From 0.145 Population To 0.343 Lights To2 Migration Out 0.060 Lights From 0.092 Migration In 0.068 EVI From x EVI To 0.107 Migration Out2 0.021 EVI From 0.082 Population To 0.090 EVI To 0.089 Migration In2 0.015 Lights To x Lights From2 0.099 0.045 Rain From 0.011 Rain To 0.008 2 Population From 2 0.123 2 Notes: Based on a regression of non-country specic variables 68 Chapter 3 Bias in Tracking Movement Using Mobile Phone Data 3.1 Introduction The last few years have seen a large rise in the use of mobile phone data for research. This is not surprising given the tremendous amount of information this data provides on location, movements and social networks that is arduous and at times impossible to collect using traditional surveys. This data has allowed us to study questions from the spread of disease (Buckee et al. 2013, Le Menach et al. 2011, Ruktanonchai et al. 2016, Wesolowski, Eagle, et al. 2012, Wesolowski, Metcalf, et al. 2015, Wesolowski, Qureshi, et al. 2015) to socioeconomic status (J. Blumenstock, Cadamuro, and On 2015), to migration and mobility patterns (J. E. Blumenstock 2012). Yet a central question in using mobile phone data is whether bias exists in the data. Especially as policy makers begin to use the analyses from these data to inform policies that could impact millions of people, it is critical to understand if there might be non-random measurement error leading to bias in the data that could impact the study results or conclusions. 69 While selection bias arising from who owns a phone and who uses the particular provider from which data are available is at minimal discussed and at times addressed in most papers utilizing this data (Buckee et al. 2013), an important issue to consider is the potential correlation between the calls and the location of individuals. In particular, many papers use the mobile phone data to study moves between dierent parts of a country in order to understand for example the spread of a disease from a high to a low prevalence area (Le Menach et al. 2011,Wesolowski, Eagle, et al. 2012). For this purpose, a location is usually assigned to each day, with any missing days allocated to the location of the closest non-missing day. If the days with and without calls are non-random, this could introduce signicant bias. For example, suppose a person always calls home when they arrive at their destination to let their family know they arrived safely, and they call every day while away, but are less likely to call once home. If a researcher were to only use the days with calls, ignoring the missing days, then too many days would be attributed to the locations individuals visit because they are more likely to call while away as compared to while in the home location. Additionally, if the person always calls upon visiting a new place, then assigning missing days based on the closest day with data would again attribute too many days to the place visited. While potential bias is something that all researchers working with data must face, the specic error from missing data is dicult to address. Ideally, if data were available on a person's actual location (say from survey data) on each day, it would be possible to compare this to the data from the mobile phones in order to see whether there is error in the location generated from the mobile phone records. Yet this data is not usually available because privacy concerns prevent the linking of phone records to identiable characteristics, and additionally, even if it were possible to collect survey data on a small number of people that agreed to participate, there could be bias based on the selection of those that choose to participate. 70 This paper addresses this challenge by using the mobile phone data itself to determine if there is systematic error in the location where we identify a person to be. Specically, the calls that I choose to make or that I receive from friends and family might be correlated with my location and movement patterns, but the calls that I receive from robo-callers or companies calling thousands of people potentially might not be correlated with my location and movement. I therefore pick out the callers that are making thousands of calls, which are unlikely to be correlated with the movements of the people being called, and use these to identify if there is measurement error in the data. In a way, this is the opposite of the cleaning described by Blondel, Decuyper, and Krings (2015), whereby researchers impose conditions to lter out the links that appear randomly in the network. I use these calls precisely because they do not represent any relationship and therefore, their timing should not correspond to actions taken by the recipients of these random calls and texts. I use the calls from these accounts making thousands of calls, which I'll call robocallers to compare the distribution of locations based on all calls with the distribution of locations using only the robo-calls within a conditional logit framework. I nd that there is little systematic error in the calls, but bias is introducted by researchers when they use the standard technique in the literature of interpolating missing days based on the closest day with a location. I show how this could be xed theoretically using the outcome from the procedure used to detect the bias. I also discuss how missing days could be interpolated in dierent ways in order to mitigate the error some, thereby providing a slightly dierent version of where individuals are located. While the measurement error I nd is small, it becomes larger for those with fewer days with calls in the data. Especially as these data and results from analyses with the data begin inuencing policy, it is important to understand potential measurement error that arises in tracking location based on phone calls. I specically look at two potential policy applicationsmeasuring population size at the administrative level 71 and measuring importation of infectious disease, with a focus on malaria in Senegal. I nd that at the annual level, the error is minimal and does not impact results, but there is more bias when looking at population numbers on a daily level as well as when measuring malaria importation. I show that depending on the focus of the analysis, it is important to give thought to how missing days are interpolated and to potentially show results with missing days treated in dierent ways in order to demonstrate the sensitivity of the analysis. The closest paper to this research is Ranjan et al. (2012), in which the authors explore potential bias in measuring movement with mobile phone CDR data by comparing it to data access data (which does not always require human initiation). This work is focused in a developed country setting where smartphone use penetration is much higher, as compared to a developing country context where this type of analysis would not be possible and where we might expect to see dierent correlations between calls and locations. The paper is also focused on within-day movements at a shorter distance, using only one month of data, while here I am interested in studying the long-distance mobility, which is the most likely to have impact on something like spread of infectious disease. Similarly to the research here, the authors do nd bias in the data, though they do not apply the results to study the impact for specic policy analyses, instead showing bias in dierent metrics of mobility. The paper starts by describing the data, it then goes on to explain the model and analysis in sections 3 and 4. Robustness checks are presented in section 5, and it concludes with section 6. 3.2 Data For the analysis I use a dataset that is made available by Sonatel and Orange in the context of the Data for Development Challenge (Montjoye et al. 2014). The full 72 dataset consists of 15 billion call and text records for Senegal between January 1, 2013 and December 31, 2013 for all of Sonatel's user base of around 9.5 million SIMs. Anonymous, random IDs are provided in order to track the same SIM over time, but there is no identifying information on the individuals. There are data on all calls and texts made or received by a SIM card, their time, date and location of the closest cell phone tower, as well as the anonymous ID of the other SIM that is part of the two-way call/text interaction. I rst dene a group of users as the robo callers that can help me to determine an unbiased measure of the location of the users they call. I dene these by rst taking the 1% of users with the highest number of annual calls or texts made or received from the full dataset of 9.5 million users. This is equivalent to all users with over 15,950 calls or texts in 2013. From these, I then took the top 5 percent with the most number of individual SIM contacts that they make calls to or send texts to (I do not include users that call them or send texts to them). These SIMs all have over 1394 dierent contacts that they reach out to at least once during the course of the year. In total, this gives me 4,580 SIMs. The calls and texts that they make I call random pings. I only use the calls or texts made by the robo-callers and not any calls or texts that the robo-callers receive. While the robo-callers are dened based on all SIMs in the data, the analysis is conducted on a random sample of 500,000 SIMs. Throughout the paper, I will refer to these SIMs as individuals, though I do not actually have data on whether a SIM corresponds to a unique individual.1 Of this sample of 500,000, there are 140,374 individuals that receive at least one ping from the robo-callers dened earlier. 1 Based on the Listening to Senegal survey conducted in 2014, of those individuals that do not own a phone, 79% state that they own a SIM card, suggesting that while physical phones might be shared, SIM cards are more likely to represent one person. Additionally, in the same survey, around 10% of individuals cite using more than one mobile phone, which would be the main reason that an individual would have multiple SIM cards from the same provider, since there is no benet to having more than one SIM from the same provider. Therefore, this is suggestive that a SIM is likely to represent a single individual. 73 The data provides the GPS locations of the mobile phone towers, which I can use to assign a location for each call or text based on the closest mobile phone tower. While the location for the tower is provided at a very detailed level, I aggregate this up to the country administrative units. I do this because I am interested in looking at the error in location when individuals travel longer distances, rather than the within-day moves between home and work for example. I use the top administrative level to dene location, consisting of 14 regions. Since I am looking at these longer distance moves, for each individual in the sample, I dene a location for each day that they make at least one call or text based on the last call or text of the day.2 I then separately dene their ping location based on the location of the last ping of the day that they receive. For the individuals that receive at least one ping, on average they have around 29 pings annually. They also, on average, have around 266 days with data based on all of their calls and 11 days based on the pings alone. The individuals that receive a ping are not a random sample of the population. Looking at their characteristics shown in Table 3.1, they have on average almost 4 times as many calls or texts as those that do not receive a ping.3 They have fewer days of data missing, they tend to be seen in more districts on average, and they have over 3 times as many contacts on average. Therefore, I will only study the group of individuals that receive a ping, and I do not compare those with and without pings. I also include a robustness check where I use a dierent denition of pings and therefore have a slightly dierent population of individuals in order to see how dependent the results are on the denition of pings. Figure 3.1 shows a comparison of the raw distribution across regions for all calls and for pings. The solid bars represent the percent of all person days spent in each 2 This is one of the standard methods used in the literature to dene a daily location (Ruktanonchai et al. 2016). 3 Since some of them actually have over 15,960 calls or texts, which was part of the denition I used to determine the robo-callers, I include a robustness check later where I remove SIMs from the data that have over 15,000 calls or texts. 74 region during 2013. The striped solid bars show the percent of all days with a ping spent in each region during 2013. There does seem to be a discrepancy, with a much higher percentage of days being spent in Dakar based on the pings, as compared to other locations such as Thies, Saint Louis, and Ziguinchor, where we see a lower percentage of days based on the pings. This is suggestive of a dierence between the calls and pings, which will be explored at the individual level in the rest of the analysis. 3.3 Model The rst goal of this analysis is to determine whether or not bias exists in the data related to correlation between the calls and the locations of an individual. Since I am studying this at the administrative geographic level, individuals have a set number of locations they can be in. Therefore, there is some probability that an individual will be seen in a given region. If the person made or received a call/text on every single day of the year, then we would know their location on each day based on the last call or text of the day. Then, the probability in 2013 of nding an individual in a location is the number of days in that location divided by 365. Yet, as the summary statistics show, the majority of people do not have a call or text on every day, so there are some days where we do not know their location. We are then interested in whether the days with calls are a correct representation for the year of where an individual is located. The way to think about this problem is if we have a person that has a probability of being in location x of 1/5 and a probability of being in location y of 4/5, then if the person receives 10 random calls during the year, on average 2 of them should be in location x and 8 of them should be in location y. Similarly, if they receive many more calls and the calls are uncorrelated with the locations, on average 1/5th of the 75 calls should be in x and 4/5ths of the calls should be in y. Therefore, the underlying framework for this exercise will be the idea that I want to compare the probability of nding an individual in a given location using their calling data with the probability of nding an individual in a given location using only the pings, which are assumed to be random and uncorrelated with a person's movement patterns and therefore represent the true underlying probability. Under the null hypothesis that there is no bias in the data, the probability distribution based on calls and the probability distribution based on pings should be the same. The way this would look is that if each individual i had a set of J choices of location j , there will be some probability that a call/text i makes or receives will be from j (therefore implying the location of the person to be j ), and there is also some probability that a ping i received will be in j . Assuming a conditional logit framework, we then have that if the distribution under the pings is γ and the distribution under the calls is δ : eδij δij J (e ) eγij P (P ingi ∈ j) = P γij J (e ) P (Calli ∈ j) = P (3.1) (3.2) If the distributions are the same, then δij = γij . I do not have the full data for each person (if I did, the bias studied in this paper would not be an issue), but I do have a large number of calls and days per person. Therefore, I approximate the probability that a call/text made or received by i is in location j as Qij , the proportion of all calls/texts that i makes in j out of all of i's calls. That means: X eδij −→ ln(Q ) = δ − ln( (eδij )) ij ij δij ) (e J J Qij = P (Calli ∈ j) = P (3.3) This implies that ln(Qij ) = δij . Therefore, it is possible to test if the two distributions 76 are equal to each other by using a conditional logit model to calculate the probability that P ingi ∈ j with ln(Qij ) as the explanatory variable on the right hand side: eβln(Qij ) βln(Qij ) ) J (e P (P ingi ∈ j) = P (3.4) If the two distributions are equal to each other, then β = 1. If β is signicantly dierent from 1, this would imply the two distributions are signicantly dierent from each other, and assuming the distribution measured by the pings is the true, unbiased distribution, it would imply there is systematic error in the distribution of locations measured by the calls. This test statistic can pick up systematic bias in the data, but it is not able to pick up if individuals have bias in their particular distribution if that bias is random across the population. Therefore, this test statistic can help us in revealing systematic bias that would be helpful to understand for aggregate analyses, but is not necessarily helpful if we are interested in individual-level analyses. I study the distributions of locations based on the full annual data available per SIM. 3.4 Analysis 3.4.1 Testing the Independence of the Pings In order to conduct the procedure to test for measurement error in the calls made and received by individuals, a key assumption is that the pings received by individuals from the SIMs that I have determined to be robo-calls are indeed independent from the individuals' choices of location. Since I do not actually know who these callers are and the reason for the calls they make and texts they send, it is dicult to know the accuracy of this assertion. I therefore try to test it using the data. There are a number of individuals that have a call or a text on every single day 77 that they are in the dataset.4 For these people there should not be any error in the location where we see them day to day since there is data available on every day of the year; therefore, the issue of individuals choosing specic days to make a call and appear in the data based on their location is not a problem. Thus, I can test if there is bias in the pings by conducting the conditional logit exercise only for the sample that has a call on every single day. If the pings are biased, so that the likelihood of receiving a ping is correlated with the location, then the distribution of pings will dier from the actual probabilities of an individual being in a location. Column 1 of Table 3.2 shows the results when using the sample of individuals with calls on all days. The table shows the coecient on the log probability values from using all calls, as well as the p-value from a test of the null hypothesis that the coecient is equal to 1. There does seem to be bias in the location detected by pings, suggesting that the pings are not uncorrelated with an individual's location. One of the obvious dierences that can be expected between pings and regular calls is that since pings are by denition made by what are assumed to be businesses, in general there should be less pings on weekends when many businesses are closed. This is evident in the data, with around 14 percent of all calls occurring on each day of the week, while for pings only around 12% occur on Saturdays and around 10% occur on Sundays. Once I remove all weekend days from the analysis and only compare calls and pings on weekdays, Column 2 of Table 3.2 shows that there is no longer a bias in the data and based on the t-test conducted, the coecient is no longer signicantly dierent from 1. This suggests that as long as the analysis is conducted for weekdays, pings should provide an unbiased representation of locations.5 4 Being in the dataset is dened as the time period from the rst day in 2013 a call or text is made or received until the last day in 2013 a call or text is made or received. 5 Figure 3.A.1 in the Appendix shows that even removing calls and pings on weekends, the dier- ence in percent of person days in each region remains consistent. 78 3.4.2 Results for Days with Calls I conduct the test for everyone in the sample receiving a ping. I also break down the population into those that are present for less than 25% of their time in the data, 25-50% of their time in the data, 50-75% of time in the data and over 75% of time in the data. Especially if we believe that someone's calling pattern could be correlated with the location of the individual, the patterns could vary among those that rarely use their phone versus those that use their phone more often. The top panel of Table 3.3 shows the results. There does seem to be a little bit of measurement error so that the distribution of all calls is signicantly dierent from the distribution of pings, though the size is relatively small. The p-value on the ttest for whether the coecient is signicantly dierent from 1 is 0.000. Splitting this by time in the data, though, there are some dierences, and it seems that there is actually no error among those in the data with less than 75% of time with data, but we do see some error in the individuals in the data with more than 75% of their data. The appendix also includes results from analyses where the days with calls have been broken down into those with pings and those without pings and only the location from days without pings is compared to the location of days with pings. 3.4.3 Results After Interpolation of Missing Days In the previous subsection, I focused on comparing only the specic days with a call to the days with pings. What is usually done in research using mobile phone data, though, is that the location of the missing days is usually interpolated between the rst and last days that an individual is in the data.6 I therefore redo the test by rst interpolating missing days for individuals and then recalculating the probabilities of being in a given region based on the data with interpolated missing days. 6 This means that if someone comes into the data in April and the last observation of the person in the dataset is in October, the months before April are not interpolated, but if the person has days missing in May, the location would be interpolated based on the location of the closest day. 79 Panel b of Table 3.3 shows the results of the analysis when comparing the distribution of locations in the interpolated data to the pings. The table shows that interpolation actually increases the bias. For the individuals with fewer days of calls, interpolation leads to the introduction of very signicant systematic error. For those with few days missing, the interpolation seems to help in bringing the coecient closer to 1, though it is still signicantly dierent. The combination of the two panels of Table 3.3 seems to suggest that there is a bit of error, especially at the upper end of the distribution of time in the data, whereby the calls do seem to be correlated with the location of the individual, but interpolating the missing days using the assumption that the missing days should be lled in based on the location of the closest day actually signicantly increases the error. If instead, interpolation was done in a dierent way that could take account of the correct calling patterns, then it would be possible to mitigate, and at the very least not increase, the error from the calling pattern. It is actually possible to use the procedure for testing for error to help with correcting the error. Using the coecient calculated by the procedure, a corrected probability can be calculated by taking the probability based on the interpolated data and taking it to the power of the coecient from the conditional logit model.7 In order to make this correction, I have to make the strong assumption that the bias is systematic and homogeneous among SIMs, so that β is the same for everyone. Table 3.4 shows the results of running the same conditional logit models as in the bottom panel of Table 3.3, except the right hand side variable is now the new probability adjusted by the coecient. After the re-weighting, the adjusted distribution is no longer signicantly dierent from the distribution of pings for those with under 75% of data. For those with over 75% of data there is a slight signicant dierence that arises from rounding.8 7 The probabilities for all the regions are then adjusted to sum to 1 for each individual. 8 The procedure was done by assigning the number of missing days to locations that the SIM has 80 Having calculated the theoretical probabilities that would help account for any bias based on correlation between calling and locations, it is possible to then go back to the data in order to interpolate the data in a way that would get the distribution closest to the theoretical probabilities calculated. I use a procedure where I attribute missing days to either the location of the last day with a call before the missing day or the rst day with a call after the missing day in such a way that I minimize the squared error between the theoretical probability and the probability after assignment of missing days. In this way, I do not randomly distribute locations to the missing days without knowing if the individual could have plausibly been in that location at that time, but rely on the presumption in the literature that the missing day would have been spent in one of the two locations seen before and after the missing day. This also means that if the location is the same before and after the missing day, then this location is automatically assigned to the missing observation. Table 3.5 shows the results from lling in missing days optimally in order to achieve as close to the theoretical probability as possible. Reassigning missing days based on the theoretical probability that would mitigate error does get us closer to coecients of 1 across the board, but does not do well in eliminating the error for those with less than 50% of data. If a researcher did not have access to the full network in order to calculate pings and the exact size of the error that might be present, but assumes the error I nd in this dataset is consistent with some calling behavior that is universal, then it is possible to think about a strategic way of allocating missing days without having information on the theoretical probability. For example, the error suggests that allocating missing days through the current procedure of closest day leads to allocating too much time to the locations where people spend the most time and not enough days to the locations visited in such a way so as to get to the theoretically correct probability. For this reason, the time spent in locations is rounded to the nearest day, leading to the slight signicant dierence in the coecient on over 75% of time individuals. 81 they only visit for a shorter period. A scenario that would have this pattern is a case where migrants living in Dakar usually travel to another region to visit family. When they arrive to visit their families, they don't have to call anyone, but whenever they go back to their new home, they immediately call their families to say they arrived safely. If that is the case, then whenever there are missing days, they could be allocated to the location with the lower current probability, thus down-weighting places with high probability. The results of using the new interpolation based on lowest probability are shown in the middle panel of Table 3.5. We see that for those with over 50% of time with data, assigning the few missing days to the location with the lower probability actually leads to a complete mitigation of the error on average. For those with less than 50% of time with data, though, using the location with the lower probability actually leads to a much larger bias. This suggests that for individuals with few missing day, this methodology, which does not require knowledge of the specic theoretical probability that one would aim to achieve, actually can help mitigate the error fully on average. For those with fewer days of data though, this cannot be done. Additionally, it is important to highlight again that this procedure provides an answer to the question of whether on average there is error, but it is also possible for particular individuals to have bias in the distribution of their locations that is not detected. The nal option, is instead of allocating missing days to a specic location, it is possible to assign the probabilities to the missing day. This methodology is a bit more agnostic because it makes no assumptions about where an individual is actually located. Panel c of table 3.5 shows the results when this is done. We see that there is a small amount of error when this methodology is used, but in general it is extremely small and corrects the large error generated for those with less than 50% of their time missing. 82 3.5 Policy Applications In this section I look at how dening the location of missing days impacts analyses that could have important policy implications. For many analyses, it is important to know the number of individuals located in specic areas and their movement patterns between locations. I rst look at the spread of individuals across dierent regions. Figure 3.2 shows the number of individuals in each region annually based on the sample of individuals with pings. The annual location is calculated by taking the location for each SIM where the most days are spent. This is calculated by lling in missing days using the traditional strategy of assigning the location of the closest day with data and also by lling in with the probability of an individual being in each region. We see that in this type of annual analysis, assignment of the missing days makes very little dierence, which is helpful because it alleviates the concern of large measurement error in this type of analysis. The blue line shows the dierence between the bars as a percent of the total population measured in the region according to the traditional method, and we can see that the dierence is around 0 for most districts, and only goes up to around 2.5% for one district. Therefore, it is not necessary to worry as much about how missing days are lled in for this type of annual analysis. In Figure 3.3, I look at two specic regions and rather than looking at the annual population, I look at the daily population numbers calculated using the traditional method or using the alternative of assigning the probability from the earlier exercise. We see that on a daily basis, there is a lot more variation, though it is not necessarily consistent. For Sedhiou in the top panel, we see that sometimes there is an underestimate and sometimes there is an overestimate in the daily population using the traditional method. For Tambacounda it seems that the population is consistently overestimated. So for daily analysis of locations, there could be some bias introduced when missing days are lled in using the traditional method. This could be especially important in the case of a natural disaster, when policy makers would want to know 83 exactly how many people are in the location aected. Finally, in Figure 3.4, I explicitly look at how dierent methods of lling in missing days can impact the measurement of expected malaria cases in a given region. I calculate this by rst summing the daily probability of malaria infection for each SIM for each month based on the location for each day, where missing days have either been lled in using the typical strategy or using the probabilities calculated from the exercise earlier. I then assign each individual a location for that month based on where the most days are spent and sum the malaria probability across individuals per region for each month. While this is not the typical method for calculating risk, assigning probabilities to the days does not allow for the calculation of transition risk (an individual going from location a to location b will bring some risk of malaria to b based on the incidence in a and the amount of time spent in a) since missing days are not assigned a set location but instead a probability of where the individual could be. Nevertheless, this provides us with a measure of how dierent the calculation of expected malaria might be depending on how missing days are treated. In the upper panel of Figure 3.4, looking at Saint Louis, we see that in general there are very few cases of malaria measured in the sample, but the number of cases measured using the probability assignment as compared to the traditional method is much larger as a percent of total cases in that region. Therefore, if researchers are focused on these very low malaria districts, how missing days are assigned could make a relatively large dierence in the measurement of expected malaria and imported malaria. In panel b of Figure 3.4, we see that although the size of the dierence in malaria cases measured using the two methodologies is relatively similar to Saint Louis, as a percent of the total case load in the region, the dierence is very small. Therefore, in doing an analysis in this region, how the missing days are assigned does not necessarily make as large of a dierence. Therefore, this shows how depending on the type of analysis a researcher is doing, it is important to potentially consider 84 systematic measurement error in classifying locations and movement, because it could impact the results of the analysis, and therefore the policy implications. 3.6 Robustness Checks The sample of individuals that receive pings includes some that could potentially be robo-callers themselves since there are some in the sample with over 15,000 calls or texts in 2013, which is the rst step that I use in dening the robo-callers. Therefore, I conduct a robustness check where I remove these outliers that make a large number of calls/texts, since for them the pings are less likely to be random and more likely to be associated with the dynamics of the business (for example a company calling or receiving calls from its suppliers). Table 3.6 shows results where those in the sample with over 15,000 calls have been removed. Comparing this to the top panel of Table 3.3, we see that the outliers do not aect either the size of the coecient signicantly, or change the signicance level of the null hypothesis test. This is not necessarily surprising because if we image these outliers to be businesses, they are unlikely to be changing location, therefore the majority of their calls and pings will be from the same location, making measurement error in location unlikely. The analysis so far has been conducted using the denition of pings based on SIMs that have the most calls and the most contacts. Nevertheless, there are dierent ways in which robo-calls could be dened, and it is possible to examine how the analysis changes with a new denition. The second denition I used is taking again the top 1% of callers (those with over 15,950 calls or texts in 2013), but rather than basing the robo-callers on contacts, I take those individuals that make over 90% of their calls from a single district location.9 The goal of this is to pick out businesses making calls, with the assumption that businesses would be more likely to remain in the same 9 Districts are smaller geographic areas than regions, with 76 health districts in Senegal. 85 location.10 There are 37,070 robo-callers under this new denition. Comparing the two groups of robo-callers, there are a total of 1,168 that fall under both denitions, while 3,412 are only in the group dened by number of contacts and 35,902 are only in the group dened by location. A total of 230,790 individuals out of the 500,000 receive a ping from either one type of robo-caller or the other. 115,506 individuals are included in both datasets, representing around 50% of the individuals in either set. Around 25 thousand individuals are only in the dataset that is dened based on the number of callers and around 90,500 individuals are only in the dataset that is dened based on the location from where calls are made. Table 3.7 shows the analysis of just those individuals that are present for every day of the year, comparing using all days and only weekdays, to check for bias in the pings. Similarly to Table 3.2, we see that while bias exists when looking at all of the data, the bias is mitigated when weekend days are removed and the comparison is only done for weekdays. Additionally, the direction of the bias is similar between the two dierent denitions of pings, and the size of the coecients are also comparable. Panel a of Table 3.8 shows the results of the procedure testing for bias. Based on the new denition of pings, the distribution of locations from calls are signicantly dierent from the distribution of locations from pings. The direction of the error is the same as shown with the previous denition of pings, and the size of the coecients are comparable, though the coecient with the new denition is closer to 1 and only signicantly dierent from 1 at the 0.1 level. The reason for this can be seen when the analysis is broken out by time in the data. We see that the coecients of those having less than 75% of time in the data go in the opposite direction of the coecient on those having over 75% of time in the data; therefore, putting all of them together means that on average we are getting close to a coecient of 1. That change in the 10 90% is used rather than 100% since, especially for a business that is located near the border of two districts, it could be possible that a call would get redirected to a tower slightly further away that might be in the neighboring district if trac to the closest tower in the same district is very heavy, or if that tower is down for some reason. 86 sign of the bias was seen in Table 3.3 as well, although here it seems to occur at a higher level of time in the data, for those with over 75% of time in the data, while earlier the switch happens around 50% of time in the data. Panel b of Table 3.8 presents the analysis when comparing the distribution of locations from pings with the distribution of locations after interpolating missing days. Similarly to Table 3.3, the bias switches direction for those with fewer days of data. We also see that for those with over 75% of days of data, the interpolation helps to mitigate the bias, which was seen in the main results. This analysis shows how even under dierent denitions of robo-calls and random pings, systematic error is consistently detected, and it is similar in direction and size. 3.7 Conclusion The paper nds some bias that is exacerbated by the standard policy in mobile phone data research of interpolating missing days based on the location of the closest day. Given the large number of papers that use the location from the mobile phone data to study many dierent questions such as spread of disease or transportation, this error could potentially impact the results and conclusions in these papers. I present several strategies for partially mitigating the bias, and reassigning missing days. These strategies could be used to create dierent scenarios to see how conclusions are affected by adjusting the locations of individuals. I also show how depending on the type of policy analysis being conducted, the error does not impact analyses conducted at the annual level, but does seem to matter more for analyses at the daily and monthly level. Specically for studying spread of malaria, we see that in some of the low malaria districts, the way that missing days are lled in for individuals could make a large dierence in the analysis conducted. This is also one of the rst papers to explicitly explore the bias that can arise if 87 calls are correlated with location, and to study the relationship between individuals' calling behavior and their location. While being directly applicable for mobile phone data users, the research can also provide insights into the social network behavioral patterns related to people's mobility and social interactions. It also helps to provide insights for other research being done with trace data such as emails, Twitter or LinkedIn in order to measure population mobility and apply it to dierent policy questions of interest. 88 Figure 3.1: Comparison of Distribution of All Day Locations Based on Calls and Pings 89 Figure 3.2: Annual Comparison of Population Numbers Notes: This gure compares number of people in each region in the sample based on their primary location in 2013 calculated by lling in missing days with the location of the closest day versus lling in the missing days with the probability from the exercise conducted using the conditional logit model. 90 Figure 3.3: Daily Comparison of Population Numbers (a) Sedhiou (b) Tambacounda Notes: This gure compares number of people in two regions in the sample on a daily basis calculated by lling in missing days with the location of the closest day versus lling in the missing days with the probability from the exercise conducted using the conditional logit model. 91 Figure 3.4: Monthly Comparison of Malaria Incidence (a) Saint Louis (b) Dakar Notes: This gure compares number of expected malaria cases in two regions in the sample on a monthly basis calculated by lling in missing days with the location of the closest day versus lling in the missing days with the probability from the exercise conducted using the conditional logit model. 92 Table 3.1: Summary Statistics for Individuals Receiving and Not Receiving Pings Did Not Receive Ping Number Number Number Number of of of of Calls Days in Data Regions SIM Contacts Received Ping Mean Standard Deviation Mean Standard Deviation 871.0 120.1 2.13 108.7 1,689 109.8 1.27 109.5 3,902 265.7 3.05 361.2 5,428 104.4 1.71 316.3 Notes: Number of regions refers to in how many dierent regions a SIM card has made or received a call/text based on the mobile phone tower location for each call and text. 93 Table 3.2: Checking the Random Pings for Independence Log Prob (1) All Days (2) Exclude Weekends 0.980 (0.00523) 0.997 (0.00484) Observations 1,063,367 931,696 Test Coef=1 .0001 .5811 Robust standard errors in parentheses Notes: I test whether the pings are independent of people's locations by comparing the location from the pings and from all calls for individuals that have data available for every single day of the year, for whom there should be no error in their day to day location. Location is dened as one of the 14 regions of Senegal. When all of the pings are used, there does appear to be error implying that the pings are not independent from the locations. Once the analysis is only done for weekdays, the error is ameliorated, since the majority seems to come from the fact that pings are less likely on weekends and people's locations are also more likely to change with weekends. 94 Table 3.3: Results of the Analysis Broken Down by Percent of Time Present in the Data (a) Days with Calls Only (1) All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.989 (0.00212) 1.022 (0.0252) 1.004 (0.0114) 0.998 (0.00657) 0.988 (0.00226) Observations Test Coef=1 4,028,773 13,846 73,220 269,673 0 0.3757 .7325 .7916 Robust standard errors in parentheses 3,672,034 0 (b) Days with Calls and Filled in Missing Days (1) Notes: All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.987 (0.00217) 0.612 (0.0200) 0.854 (0.0130) 0.972 (0.00797) 0.993 (0.00230) Observations Test Coef=1 4,028,773 13,846 73,220 269,673 0 0 0 .0004 Robust standard errors in parentheses 3,672,034 .0028 The top panel of the table shows the coecients from estimating conditional logit models with a dummy for whether a ping occurred in a given region on the left hand side and the probability of being in a region based on all calls on the right hand side. The bottom panel is done for all days, where missing days are assigned the region of the closest day with a call or text. The table shows tests of whether the coecient in each regression is dierent from 0. Standard errors are clustered at the individual level. 95 Table 3.4: Theoretically Fixed Results (1) New Log Probability Observations Test Coef=1 Notes: All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days 0.993 (0.00219) 1.000 (0.0327) 1.000 (0.0153) 0.990 (0.00823) 0.994 (0.00231) 4,028,773 13,846 73,220 269,673 .0028 .9942 .9743 .2118 Robust standard errors in parentheses 3,672,034 .0053 The probabilities from all calls are reweighted based on the coecients from the analysis shown in Table 3b. The conditional logit specication is now run with the reweighted probabilities on the right-hand side. Standard errors are clustered at the individual level. 96 Table 3.5: Correcting how Missing Days Are Assigned (a) Based on the Missing Days Assigned to Achieve Theoretical Probability (1) All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.990 (0.00219) 0.634 (0.0223) 0.899 (0.0147) 0.985 (0.00820) 0.994 (0.00231) Observations Test Coef=1 4,028,773 0 13,846 0 73,220 0 269,673 .0621 3,672,034 .0065 (b) Based on the Missing Days Assigned the Lower Probability (1) All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.994 (0.00224) 0.367 (0.0172) 0.825 (0.0134) 1.014 (0.00904) 1.002 (0.00237) Observations Test Coef=1 4,028,773 .0049 13,846 0 73,220 0 269,673 .1169 3,672,034 .4537 (c) Based on the Missing Days Assigned the Theoretical Probability (1) All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.992 (0.00215) 1.100 (0.0352) 1.057 (0.0145) 1.016 (0.00724) 0.989 (0.00226) Observations Test Coef=1 4,028,773 13,846 73,220 269,673 .0004 .0043 .0001 .024 Robust standard errors in parentheses 3,672,034 0 Notes: In the top panel, the location of missing days with no calls or texts has been lled in based on either the last day with a location before the missing day or the rst day with a location after the missing day in such a way as to achieve a probability in each location as close to the theoretical as possible. In the middle panel, the missing days have been lled in based on either the last day with a location before the missing day or the rst day with a location after the missing day, choosing the one that has the lower probability based on just the calls. In the bottom panel, the missing days are assigned the theoretical probability. 97 Table 3.6: Removing Individuals in the Data with Over 15,000 Calls (1) All (2) <.25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.989 (0.00209) 1.021 (0.0252) 1.003 (0.0114) 0.996 (0.00668) 0.988 (0.00225) Observations Test Coef=1 3,330,108 13,814 73,035 261,307 0 .3958 .7979 .5705 Robust standard errors in parentheses 2,981,952 0 Notes: The main analyses are rerun where any SIMs from the sample with over 15,000 calls or texts in 2013 are removed from the analysis. Standard errors are clustered at the individual level. 98 Table 3.7: Checking the Random Pings for Independence Based on New Denition logprob (1) Region All Days (2) Region Excluding Weekends 0.987 (0.00326) 0.991 (0.00316) Observations 2,366,086 2,094,635 Test Coef=1 .0001 .0048 Robust standard errors in parentheses Notes: A new set of pings is determined based on the stationarity of the location where calls are made from rather than based on the number of contacts. Column 1 shows a test of how well the pings predict location for individuals in the data that have an observation for every day. Column 2 only focuses on weekdays and looks at how well the pings predict location on weekdays. Standard errors are clustered at the individual level. 99 Table 3.8: Results of the Analysis for Dierent Denition of Pings (a) Days with Calls Only (1) All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.998 (0.00137) 1.062 (0.0164) 1.037 (0.00799) 1.003 (0.00439) 0.996 (0.00147) Observations Test Coef=1 8,533,986 36,518 167,390 574,526 .0838 .0002 0 .546 Robust standard errors in parentheses 7,755,552 .0058 (b) Days with Calls and Filled in Missing Days (1) Notes: All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 0.994 (0.00144) 0.627 (0.0138) 0.883 (0.00919) 0.975 (0.00551) 1.001 (0.00152) Observations Test Coef=1 8,533,986 36,518 167,390 574,526 0 0 0 0 Robust standard errors in parentheses 7,755,552 .6212 The main analysis is redone using the new denition of pings based on the geographical stationarity of the callers, rather than the number of contacts. Standard errors are clustered at the individual level. 100 3.A Additional Tables and Figures Figure 3.A.1: Comparison of Distribution of All Day Locations Based on Calls and Pings Excluding Weekends 101 Table 3.A.1: Results of the Analysis Broken Down by Number of Days Present in the Data, Call and Ping Days Looked at Separately (1) All (2) <25% of Days (3) 25-50% of Days (4) 50-75% of Days (5) >75% of Days Log Prob 1.006 (0.00319) 0.982 (0.0304) 1.014 (0.0148) 0.995 (0.00900) 1.007 (0.00344) Observations Test Coef=1 3,581,073 12,874 68,901 250,352 .0479 .324 .3555 .5513 Robust standard errors in parentheses 3,248,946 .0369 Notes: In this analysis, the distribution of locations based on days with no pings is compared to the distribution of location based on days with pings. Standard errors are clustered at the individual level. 102 Bibliography Adda, Jérôme (2016). Economic Activity and the Spread of Viral Diseases: Evidence from High Frequency Data. The Quarterly Journal of Economics 131.2, pp. 891 941. Adriansen, Hanne Kirstine (2008). Understanding pastoral mobility: the case of Senegalese Fulani. The Geographical Journal 174.3, pp. 207222. Agence Nationale de la Statistique et de la Démographie - Ministére de l'Economie, des Finances et du Plan (2014). Enquete a l'ecoute du Senegal 2014. Aker, Jenny C, Christopher Ksoll, et al. (2015). Call Me Educated: Evidence from a Mobile Monitoring Experiment in Niger. Tech. rep. 406. CGD Working Paper. Alegana, Victor A et al. (2013). Estimation of malaria incidence in northern Namibia in 2009 using Bayesian conditional-autoregressive spatialtemporal models. Spa- tial and spatio-temporal epidemiology 7, pp. 2536. ANSD and ICF International (2015). Senegal DHS, 2014 - Final Report Continuous 2012-14. Balcan, Duygu et al. (2009). Multiscale mobility networks and the spatial spreading of infectious diseases. Proceedings of the National Academy of Sciences 106.51, pp. 2148421489. Bleakley, Hoyt (2007). Disease and development: evidence from hookworm eradication in the American South. The Quarterly Journal of Economics 122.1, p. 73. 103 Bleakley, Hoyt (2010). Malaria eradication in the Americas: A retrospective analysis of childhood exposure. American Economic Journal: Applied Economics 2.2, pp. 145. Blondel, Vincent D, Adeline Decuyper, and Gautier Krings (2015). A survey of results on mobile phone datasets analysis. EPJ Data Science 4.1, p. 10. Blumenstock, Joshua E (2012). Inferring patterns of internal migration from mobile phone call records: Evidence from Rwanda. Information Technology for Develop- ment 18.2, pp. 107125. Blumenstock, Joshua, Gabriel Cadamuro, and Robert On (2015). Predicting poverty and wealth from mobile phone metadata. Science 350.6264, pp. 10731076. Bogoch, Isaac I et al. (2016). Anticipating the international spread of Zika virus from Brazil. Lancet (London, England) 387.10016, pp. 335336. Buckee, Caroline O et al. (2013). Mobile phones and malaria: modeling human and parasite travel. Travel medicine and infectious disease 11.1, pp. 1522. Casey, Nicholas (2016). Hard Times in Venezuela Breed Malaria as Desperate Flock to Mines. The New York Times August 15, 2016. Center for Disease Control (2015). Anopheles Mosquitoes. http://www.cdc.gov/malaria/ about/ biology/mosquitoes/'. Accessed: 2015-05-06. Cho, Eunjoon, Seth A Myers, and Jure Leskovec (2011). Friendship and mobility: user movement in location-based social networks. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 10821090. Climate Prediction Center (2016). Climate Prediction Center (CPC) Rainfall Estima- tor (RFE) for Africa. ftp://ftp.cpc.ncep.noaa.gov/fews/fewsdata/africa/rfe2/geoti/. Cohen, Jessica and Pascaline Dupas (2010). Free distribution or cost-sharing? Evidence from a randomized malaria prevention experiment. The Quarterly Journal of Economics, pp. 145. 104 Cohen, Jessica, Pascaline Dupas, and Simone Schaner (2015). Price subsidies, diagnostic tests, and targeting of malaria treatment: evidence from a randomized controlled trial. The American Economic Review 105.2, pp. 609645. Cohen, Justin M et al. (2013). Rapid case-based mapping of seasonal malaria transmission risk for strategic elimination planning in Swaziland. Malaria journal 12.1, p. 1. Cohen, Justin et al. (2012). Malaria resurgence: a systematic review and assessment of its causes. Malar J 11.1, p. 122. Cosner, C et al. (2009). The eects of human movement on the persistence of vectorborne diseases. Journal of theoretical biology 258.4, pp. 550560. Deshingkar, Priya and Sven Grimm (2005). Internal migration and development: A global perspective. 19. United Nations Publications. Djibo, Yacine and Fara Ndiaye (2013). Teaming up Against Malaria. Tech. rep. Doolan, Denise L, Carlota Dobaño, and J Kevin Baird (2009). Acquired immunity to malaria. Clinical microbiology reviews 22.1, pp. 1336. Enns, E.A. and J.H. Amuasi (2013). Human mobility and communication patterns in Cote d'Ivoire: A network perspective for malaria control. Tech. rep. D4D Challenge 1 Book. Epstein, Joshua M et al. (2007). Controlling pandemic u: the value of international air travel restrictions. PloS one 2.5, e401. Fall, Abdou Salam (1998). Migrants' long-distance relationships and social networks in Dakar. Environment and Urbanization 10.1, pp. 135146. Fall, Papa Demba, María Hernández Carretero, and Mame Yassine Sarr (2010). Coun- try and Research Areas Report. Fischer, Henry H et al. (2016). Text message support for weight loss in patients with prediabetes: a randomized clinical trial. Diabetes care, p. dc152137. 105 Gething, Peter W et al. (2011). A new world malaria map: Plasmodium falciparum endemicity in 2010. Malar J 10.378, pp. 14752875. Goldsmith, Peter D, Kisan Gunjal, and Barnabe Ndarishikanye (2004). Ruralurban migration and agricultural productivity: the case of Senegal. Agricultural eco- nomics 31.1, pp. 3345. Herrera, Catalina and David E Sahn (2013). Determinants of internal migration among Senegalese youth. Cornell Food and Nutrition Policy Program Working Paper 245. Hoshen, Moshe B and Andrew P Morse (2004). A weather-driven model of malaria transmission. Malaria Journal 3.1, p. 32. Hsiang, Solomon M (2010). Temperatures and cyclones strongly associated with economic production in the Caribbean and Central America. Proceedings of the National Academy of sciences 107.35, pp. 1536715372. Huang, Zhuojie, Andrew J Tatem, et al. (2013). Global malaria connectivity through air travel. Malar J 12.1, p. 269. International Telecommunication Union (2013). World Telecommunication/ICT Indicators Database. Killeen, Gerry F, Amanda Ross, and Thomas Smith (2006). Infectiousness of malariaendemic human populations to vectors. The American journal of tropical medicine and hygiene 75.2 suppl, pp. 3845. Ksoll, Christopher et al. (2014). Learning without Teachers? A Randomized Experiment of a Mobile Phone-Based Adult Education Program in Los Angeles. Center for Global Development Working Paper 368. Le Menach, Arnaud et al. (2011). Travel risk, malaria importation and malaria transmission in Zanzibar. Scientic reports 1. Lima, A et al. (2015). Disease containment strategies based on mobility and information dissemination. Scientic reports 5, p. 10650. 106 Linares, Olga F (2003). Going to the city...and coming back? Turnaround migration among the Jola of Senegal. Africa 73.01, pp. 113132. Linn, Anne M et al. (2015). Reduction in symptomatic malaria prevalence through proactive community treatment in rural Senegal. Tropical Medicine & Interna- tional Health 20.11, pp. 14381446. Littrell, Megan et al. (2013). Case investigation and reactive case detection for malaria elimination in northern Senegal. Malar J 12, p. 331. Lu, Guangyu et al. (2014). Malaria outbreaks in China (19902013): a systematic review. Malar J 13, p. 269. Lynch, Caroline A et al. (2015). Association between recent internal travel and malaria in Ugandan highland and highland fringe areas. Tropical Medicine & International Health 20.6, pp. 773780. Macdonald, George et al. (1957). The epidemiology and control of malaria. The Epidemiology and Control of Malaria. Montalvo, Jose G and Marta Reynal-Querol (2007). Fighting against malaria: prevent wars while waiting for the miraculous vaccine. The Review of Economics and Statistics 89.1, pp. 165177. Montjoye, Yves-Alexandre de et al. (2014). D4D-Senegal: The Second Mobile Phone Data for Development Challenge. Ndiath, Mamadou O et al. (2012). Low and seasonal malaria transmission in the middle Senegal River basin: identication and characteristics of Anopheles vectors. Parasites & vectors 5.1, p. 1. Nickell, Stephen (1981). Biases in dynamic models with xed eects. Econometrica: Journal of the Econometric Society, pp. 14171426. Nosten, François and Nicholas J White (2007). Artemisinin-based combination treatment of falciparum malaria. The American journal of tropical medicine and hy- giene 77.6 Suppl, pp. 181192. 107 Okell, Lucy C et al. (2012). Factors determining the occurrence of submicroscopic malaria infections and their relevance for control. Nature communications 3, p. 1237. Osorio, Lyda, Jim Todd, and David J Bradley (2004). Travel histories as risk factors in the analysis of urban malaria in Colombia. The American journal of tropical medicine and hygiene 71.4, pp. 380386. Oster, Emily (2012). Routes of Infection: Exports and HIV Incidence in Sub-Saharan Africa. Journal of the European Economic Association 10.5, pp. 10251058. Pison, Gilles et al. (1993). Seasonal migration: a risk factor for HIV infection in rural Senegal. JAIDS Journal of Acquired Immune Deciency Syndromes 6.2, pp. 196 200. PNLP, INFORM, LSHTM (2015). Senegal: A Prole of Malaria Control and Epi- demiology. Prothero, R Mansell (1977). Disease and mobility: a neglected factor in epidemiology. International Journal of Epidemiology 6.3, pp. 259267. Ranjan, Gyan et al. (2012). Are call detail records biased for sampling human mobility? ACM SIGMOBILE Mobile Computing and Communications Review 16.3, pp. 3344. Ross, Ronald (1910). The prevention of malaria. Dutton. Ruktanonchai, Nick W et al. (2016). Identifying malaria transmission foci for elimination using human mobility data. PLoS Comput Biol 12.4, e1004846. Russell, Paul F and Domingo Santiago (1934). Flight range of the funestus-minimus subgroup of anopheles in the Philippines. The American Journal of Tropical Medicine 14.2. Senegal Ministry of Health (2016). M-Diabete: Le Mobile au Service de la Lutte contre le Diabete. url: http://www.mdiabete.sante.gouv.sn/. 108 Siri, Jose G et al. (2010). Signicance of travel to rural areas as a risk factor for malarial anemia in an urban setting. The American journal of tropical medicine and hygiene 82.3, pp. 391397. Smith, David L, Chris J Drakeley, et al. (2010). A quantitative analysis of transmission eciency versus intensity for malaria. Nature communications 1, p. 108. Smith, David L and F Ellis McKenzie (2004). Statics and dynamics of malaria infection in Anopheles mosquitoes. Malaria Journal 3.1, p. 13. Stuckler, David et al. (2011). Mining and risk of tuberculosis in sub-Saharan Africa. American journal of public health 101.3, pp. 524530. Tam, Clarence C, Mishal S Khan, and Helena Legido-Quigley (2016). Where economics and epidemics collide: migrant workers and emerging infections. The Lancet. Tatem, Andrew J, Youliang Qiu, et al. (2009). The use of mobile phone data for the estimation of the travel patterns and imported Plasmodium falciparum rates among Zanzibar residents. Malar J 8, p. 287. Tatem, Andrew J and David L Smith (2010). International population movements and regional Plasmodium falciparum malaria elimination strategies. Proceedings of the National Academy of Sciences 107.27, pp. 1222212227. The Global Fund (2016). Financials. url: http://www.theglobalfund.org/en/ financials/. Thomas, Christopher J, Dónall E Cross, and Claus Bøgh (2013). Landscape movements of anopheles gambiae malaria vector mosquitoes in Rural Gambia. PloS one 8.7, e68679. Tizzoni, Michele et al. (2014). On the use of human mobility proxies for modeling epidemics. PLoS Comput Biol 10.7, e1003716. Torres-Sorando, Lourdes and Diego J Rodriguez (1997). Models of spatio-temporal dynamics in malaria. Ecological modelling 104.2, pp. 231240. 109 Vercruysse, J, M Jancloes, and L Van de Velden (1983). Epidemiology of seasonal falciparum malaria in an urban area of Senegal. Bulletin of the WHO 61.5, p. 821. Wesolowski, Amy, Caroline O Buckee, et al. (2013). The use of census migration data to approximate human movement patterns across temporal scales. PloS one 8.1, e52971. Wesolowski, Amy, Nathan Eagle, et al. (2012). Quantifying the impact of human mobility on malaria. Science 338.6104, pp. 267270. Wesolowski, Amy, CJE Metcalf, et al. (2015). Quantifying seasonal population uxes driving rubella transmission dynamics using mobile phone data. Proceedings of the National Academy of Sciences 112.35, pp. 1111411119. Wesolowski, Amy, Taimur Qureshi, et al. (2015). Impact of human mobility on the emergence of dengue epidemics in Pakistan. Proceedings of the National Academy of Sciences 112.38, pp. 1188711892. Wesolowski, Amy, Gillian Stresman, et al. (2014). Quantifying travel behavior for infectious disease research: a comparison of data from surveys and mobile phones. Scientic reports 4. White, NJ (1997). Assessment of the pharmacodynamic properties of antimalarial drugs in vivo. Antimicrobial agents and chemotherapy 41.7, p. 1413. Wiser, Mark (2010). Protozoa and human disease. Garland Science. World Malaria Report (2014). Tech. rep. World Health Organization. 110