Abstract of “Probabilistic Scene Grammars: A General-Purpose Framework For Scene Understanding” by Jeroen Chua, Ph.D., Brown University, May 2018. We propose a general-purpose probabilistic framework for scene understanding tasks. We show that several classical scene understanding tasks can be modeled and addressed under a common representation, approximate inference scheme, and learning algorithm. We refer to this approach as the Probabilistic Scene Grammar (PSG) framework. The PSG framework models scenes using probabilistic grammars which capture relationships between objects in terms of compositional rules that provide important contextual cues for inference with ambiguous data. We show how to represent the distribution defined by a probabilistic grammar using a factor graph. We also show how to estimate the parameters of a grammar using an approximate version of ExpectationMaximization, and describe an approximate inference scheme using Loopy Belief Propagation with an efficient message-passing scheme. Inference with Loopy Belief Propagation naturally combines bottom-up and top-down contextual information and leads to a robust algorithm for aggregating evidence. To demonstrate the generality of the approach, we evaluate the PSG framework on the scene understanding tasks of contour detection, face localization, and binary image segmentation. The results of the PSG framework are competitive with algorithms specialized for these scene understanding tasks. Probabilistic Scene Grammars: A General-Purpose Framework For Scene Understanding by Jeroen Chua B. A. Sc., University of Toronto, 2010 M. A. Sc.,University of Toronto, 2012 A dissertation submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the Department of Computer Science at Brown University Providence, Rhode Island May 2018 c Copyright 2018 by Jeroen Chua This dissertation by Jeroen Chua is accepted in its present form by the Department of Computer Science as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date Professor Pedro Felzenszwalb, Director Recommended to the Graduate Council Date Professor Erik Sudderth, Reader Date Professor Stuart Geman, Reader Approved by the Graduate Council Date Andrew G. Campbell Dean of the Graduate School iii Acknowledgements First and foremost, I’d like to thank my advisor Pedro Felzenszwalb for guiding me through graduate school. Throughout my PhD, Pedro has provided insightful comments, crucial connections to past work, and pointed out where more work was needed to understand the models and approaches we considered. Without his research vision and utmost patience with my at times floundering progress, this thesis would not be possible. Thank you, Pedro, and I hope your influence of attention to detail stays with me for the entirety of my career! I would also like to thank Stuart Geman and Erik Sudderth. I first spoke with Professor Geman at a Snowbird Learning workshop, where he asked me a question about how a model I had presented handled explaining-away. It was an insightful comment, and I had hoped to be able to collaborate with him at Brown University. I was fortunate for this wish to be granted, and our early meetings with Pedro and Jackson Loper were invaluable in establishing the research direction I would take during my PhD. I have also been fortunate enough to interact with Erik Sudderth and his research group. Erik has provided insightful comments and suggestions during my PhD studies and welcomed me to join his group’s research meetings. For that I am grateful as it provided an opportunity to learn about the cool things that his group was doing, but also gave me the opportunity to chat with his group. Not only have the faculty here at Brown University been a source of support and inspiration, but the graduate students and post-docs have been as well. I’d particularly like to thank John Oberlin, Sobhan Parizi, and Pierre-Yves Laffont for helpful research discussions and overall just being really awesome people! I’d also like to thank Chris Blake, for extremely useful tips on writing, being an academic, and for simply being a source of encouragement and inspiration. Be it chats in the robotics lab, chats over a barbell at the gym, or chats over a fiercely contested board game, discussions were always lively and I am thankful to have had the opportunity to collaborate with you all! I also thank my family for their support and understanding. In particular, they have tolerated my desire to be a student for seemingly as long as possible. Throughout my life, their love and support has enabled me to pursue what truly interests me and has made me the person I am today. iv Last and certainly not least, I’d like to thank Sunny, my loving girlfriend, for PhD-levels of support, encouragement, and patience. Besides my advisor, I’m pretty sure Sunny has heard me talk the most about my work and the nitty gritty details of what that has entailed. This thesis would not be possible without her support, love, and encouragement. v Contents 1 Introduction 1 1.1 Design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Approximate inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.5 General implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Probabilistic Scene Grammars 14 3 Example grammars 19 3.1 Scenes with curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Scenes with faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Scenes with binary segmentation masks . . . . . . . . . . . . . . . . . . . . . . . 25 4 5 Factor Graph Representation 28 4.1 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Inference Using Loopy Belief Propagation 37 5.1 Overview of LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Efficient message computation for LBP in the PSG framework . . . . . . . . . . . 38 5.2.1 Message passing for Leaky-OR factors . . . . . . . . . . . . . . . . . . . 40 5.2.2 Message passing for Selection factors . . . . . . . . . . . . . . . . . . . . 43 vi 5.3 6 7 5.2.3 Message passing for Berns factors . . . . . . . . . . . . . . . . . . . . . . 46 5.2.4 Proof of Theorem 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Markov Chain Monte Carlo as an alternative to LBP . . . . . . . . . . . . . . . . . 49 Example grammars: Inference with LBP 51 6.1 LBP computations with a curve grammar . . . . . . . . . . . . . . . . . . . . . . 51 6.2 LBP computations with a face grammar . . . . . . . . . . . . . . . . . . . . . . . 53 6.3 LBP computations with a binary segmentation grammar . . . . . . . . . . . . . . . 55 Connections to Pictorial Structures 58 7.1 Pictorial Structures: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2 Expressing a Pictorial Structure model as a PSG . . . . . . . . . . . . . . . . . . 59 Example construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Pictorial Structures vs. PSG : graphical models and inference . . . . . . . . . . . . 62 7.2.1 7.3 8 Learning Model Parameters 63 8.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.2 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.3 Applying EM to the PSG framework . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.3.1 M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.3.2 Approximate E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Effectiveness of approximate EM learning . . . . . . . . . . . . . . . . . . . . . . 75 8.4 9 Experiments 77 9.1 Contour detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.1.1 The PSG contour model . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9.1.2 Qualitative contour detection results . . . . . . . . . . . . . . . . . . . . . 79 9.1.3 Quantitative contour detection results . . . . . . . . . . . . . . . . . . . . 80 Face Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 9.2.1 Dataset: Labelled Faces in the Wild . . . . . . . . . . . . . . . . . . . . . 84 9.2.2 Dataset: Family Portraits . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 9.2.3 Face Detection Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 9.2.4 Face data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 9.2.5 Fitting model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 9.2.6 Face localization results on single-face images: LFW . . . . . . . . . . . . 91 9.2.7 Face localization results on multiple-face images: Portraits . . . . . . . . . 94 9.2 vii 9.2.8 9.3 Face localization without a Face data model . . . . . . . . . . . . . . . . . 97 Binary image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.3.1 The PSG binary image segmentation models . . . . . . . . . . . . . . . . 100 9.3.2 Qualitative binary image segmentation results . . . . . . . . . . . . . . . . 104 9.3.3 Quantitative binary image segmentation results . . . . . . . . . . . . . . . 106 10 Grammar Transformations 108 10.1 Counting factor graph edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 10.2 Reducing the number of factor graph edges . . . . . . . . . . . . . . . . . . . . . 110 10.3 Approximating an N -D distribution by a product of N 1-D distributions . . . . . . 111 10.3.1 Alternative approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 114 10.4 Decomposing a 1D Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 114 10.5 Applications to PSG design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10.5.1 Constructing G 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10.5.2 Examples: transformation of grammars . . . . . . . . . . . . . . . . . . . 121 10.6 Notes on 1-D Uniform distributions with a prime support size . . . . . . . . . . . 126 10.7 Notes on general 1-D Categorical distributions . . . . . . . . . . . . . . . . . . . 126 11 Contributions and Suggestions for Future Research 128 11.1 Summary of approach and research contributions . . . . . . . . . . . . . . . . . . 128 11.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 11.2.1 Integration of deep learning models . . . . . . . . . . . . . . . . . . . . . 129 11.2.2 Applications to more scene understanding tasks . . . . . . . . . . . . . . . 130 11.2.3 Structure learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Bibliography 132 viii Chapter 1 Introduction In this thesis we present a general-purpose probabilistic framework for scene understanding tasks. We show that several classical scene understanding problems can be modeled and addressed under a common representation, approximate inference scheme, and learning algorithm. We refer to this general-purpose framework as the Probabilistic Scene Grammar (PSG) framework. Currently in the field of computer vision, different scene understanding tasks lend themselves to different representations and algorithms. Table 1.1 gives a few examples of scene understanding tasks and approaches to address them. Table 1.1 is not an exhaustive list of tasks, but illustrates the current state of affairs in the computer vision field whereby scene understanding tasks are often tackled by problem-specific approaches. The study and realization of a general scene understanding framework has several benefits, outlined below. • Fundamental improvements to such a general framework yields benefits and potential improvements on many scene understanding tasks simultaneously. In contrast, if an algorithm is specifically designed for a single task, then improving that algorithm only realizes improvements on that one task. • The space of consumer applications is rapidly expanding and novel scene understanding tasks are being proposed. One such an example is image-to-caption described in [61]. In this task, the goal is to generate a string of text that describes an image. Although it is possible to design new representations and algorithms to address each novel scene understanding task as they arise, such a strategy is laborious. An alternative to this problem-specific approach is to have a framework that is expressive enough to handle scene understanding tasks in general, concrete enough to be implemented, and fast enough to be practical. The hope is that if novel scene 1 2 Scene understanding task Contour detection Image segmentation Image recognition 2D and 3D object localization Approach taken Deep Learning [50, 66] Canny-edge Detection [7] Global Probability-of-Boundary [1] Field-of-patterns [16] Cut-based approaches [51, 5, 6] Level-sets [60] Random Forests [49] Global Probability-of-Boundary [1] Markov-random Fields, Conditional-random Fields [29, 31, 2] Mumford and Shah [40] Convolutional Neural Networks [36, 32] Bag-of-words/Spatial Pyramid models [35, 23] Dalal and Triggs pedestrian detector [9] DPM [22] Pictorial Structures [13] Convolutional Neural Network [46, 21] Clouds of Oriented Gradients [47] Table 1.1: Common scene understanding tasks and some approaches to address them. Note the myriad of distinct approaches across and within tasks. understanding tasks can be expressed in a form compatible with this general framework, then suitable solutions can be found with minimal research and engineering work. • Related scene understanding tasks may constrain or provide useful information for other scene understanding tasks. For example, solving the image segmentation problem may help with object recognition since image segments may correspond to objects and the shape of the segments can be useful information in recognizing objects. In the work of [57], the problems of motion estimation and image segmentation inform one another since rigid objects tend to have similar motion, and entities with similar motion across a long time-scale may belong to the same object. By iteratively refining the solution to one task by conditioning on the solution of a related task, one may achieve a better overall result than by handling each task in isolation. A general-purpose, unified framework for scene understanding tasks would allow one to naturally model different tasks simultaneously and combine their results in a principled fashion. • It is a scientifically interesting question to ask whether these myriad of scene understanding tasks, which have historically been addressed with different representations and algorithms in different formalisms, can be understood in a general-purpose probabilistic framework. In particular, one may ask questions such as “How can we represent different scene understanding 3 tasks under a common schema?”, “What is a practical, effective problem-agnostic inference scheme?”, and “How does one perform problem-agnostic learning and parameter estimation?”. Studying such questions may deepen our understanding of the visual world and the nature of scene understanding tasks facing the field of computer vision. In this thesis, we take some steps in answering such questions. 1.1 Design considerations To design a general-purpose probabilistic framework for scene understanding tasks, two issues must be carefully considered: the modeling of contextual information and efficiency of inference. Consider the image recognition task in Figure 1.1. Even to humans, the image patches shown are ambiguous and it can be difficult to determine what the objects are. The full images from which the image patches were taken are shown in Figure1.2. After seeing the entire image, recognizing the depicted objects and object parts is straightforward. Figure 1.1: Each image patch depicts a part of an object. Name the object and part. There is little agreement in the computer vision literature about what constitutes “context”, though it is typically taken to denote “any and all information that may influence the way a scene is perceived” [55]. In the tasks of image recognition and object localization, one notion of context is that objects have part/whole relationships and certain objects often co-occur. In the examples in Figure 1.1, knowing the object from which those object parts come from aid in recognizing those parts. In the task of contour detection, one may use the idea that contours tend to be long, contiguous curves. In image segmentation, one may use the idea that objects tend to be compact in space, and so image segments should be compact. If one is to build a general-purpose framework for scene understanding, it is crucial for that framework to be able to express a notion of context suitable for a range of scene understanding tasks. In the PSG framework, we model the broad notion of context in terms of compositional and geometric relationships between objects. 4 Figure 1.2: Original images the image patches from Figure 1.1 were taken from. The image patches are denoted in blue boxes. The objects/parts are: bird/beak, bicycle/cogset, chair/armrest. 5 For practical reasons, we seek to develop efficient inference schemes for tackling scene understanding tasks. Suppose we have an autonomous driving car that needs to detect other cars to avoid collisions; a system in practice has milliseconds to detect the other cars and plan a collision-avoiding route. Consider another example where one has a 3D brain scan of a hospital patient, and one must determine if the patient has a life-threatening brain hemorrhage and if so, output a 3D segmentation that localizes the site of the hemorrhage. Here too, time is of the essence. Because some scene understanding tasks may be time-sensitive, we are concerned with developing a general-purpose framework that is not only flexible enough to be applicable to a diverse set of scene understanding tasks, but also admits efficient inference. Unfortunately, exact inference in a general probabilistic model is intractable. In this work, we seek to develop efficient approximate inference schemes. 1.2 Thesis contributions In this thesis we address four key aspects of defining and assessing a general-purpose probabilistic framework for scene understanding: 1) the representation of scene understanding tasks under a common schema, 2) efficient, problem-agnostic approximate inference, 3) the learning of model parameters under varying levels of supervision, 4) the experimental evaluation of the framework. A final contribution is the concertization of the framework in a single, general implementation. We refer to the framework developed in this thesis as the Probabilistic Scene Grammar (PSG) framework. 1.2.1 Representation To represent general scene understanding tasks, we use probabilistic grammars which have been successful in object modeling for object recognition (see [28, 3, 59, 15, 68, 22, 17, 13, 17, 70]). Probabilistic grammars are defined in terms of a set of symbols, a set of rules that represent relationships between symbols, and a set of rule probabilities that encode how often those relationships occur. The set of symbols represents entities we wish to reason about. For example, the symbols of the grammar might be a face and its parts if we wish to detect faces in scenes, or it might be a set of short curves that compose into long curves if we wish to detect contours. Compositional relationships such as “a face has two eyes, a nose and a mouth, and sometimes a beard”, and geometric relationships such as “the mouth is located somewhere below the centre of the face”, are encoded as rules and rule probabilities in the grammar. Importantly, probabilistic grammars express a notion of context through compositional and geometric relationships. Such relationships provide contextual cues for inference with ambiguous data. For example, the presence of some parts of a face in a scene provides contextual cues for the presence of a face and its other parts. 6 To better understand a probabilistic model, it is often helpful to examine samples from the model when possible. To give a sense of the kinds of models that can be represented in the PSG framework, we show samples drawn from example models in Figure 1.3. Chapter 3 describes the exact models used to generate the samples. 1.2.2 Approximate inference To perform approximate inference, rather than operate directly on probabilistic grammars, we first define a novel transformation from a probabilistic grammar to a graphical model represented by a factor graph. This transformation induces a probability distribution over interpretations of a scene. Unfortunately, exact inference with a general probabilistic model is is NP-hard (see [8]), so if we are to have a flexible representation applicable to many understanding tasks, it is necessary to either restrict the form of the probabilistic model to make inference tractable or to employ approximate inference techniques. In this thesis, we choose the latter. Fortunately, there has been much work on approximate inference schemes in factor graphs; Loopy Belief Propagation (LBP) [33] is one such approach and has been shown in practice to give good results on a variety of tasks [42, 33, 14]. LBP performs approximate inference by passing “messages” between the nodes of a factor graph until some convergence criterion is met. The messages can then be used to compute marginal probabilities and answer questions such as “what is the probability there is a face at location (x, y) in the scene?”. The PSG framework makes use of special cases of LBP whereby messages can be computed efficiently. One of our contributions is the derivation of efficient analytical methods for computing messages for the factor graphs under consideration. 1.2.3 Learning As with many scene understanding approaches, the PSG framework has model parameters that are ideally learned from data. A rule probability encoding how often a face has a beard is one such model parameter. In general, to learn model parameters, we employ an approximate ExpectationMaximization (EM) algorithm. Here, the exact posterior quantities computed in the Expectationstep are replaced by an approximation to the posterior computed by LBP, and the Maximizationstep is standard. The general idea of replacing the exact posteriors computed in the Expectationstep by approximate posteriors computed by LBP was studied in [26]. While [26] was primarily concerned with convergence guarantees, for the models in this thesis, the primary issues are speed and performance of the learned models; convergence failure was not a major issue. Further, [26] specifies a Maximization-step that can be intractable to perform for some probabilistic models. In this 7 thesis, we show that for the family of models considered, the Maximization-step can be computed efficiently. Since the Expectation-step is only approximate, this approximate EM algorithm is not guaranteed to have a non-decreasing data-likelihood; however, we have found that this approximate EM algorithm leads to empirically good results. 1.2.4 Experimental evaluation We evaluate the PSG framework on three scene understanding tasks: contour detection, face localization, and image segmentation. We show that the PSG framework is competitive with algorithms specifically designed for these tasks, despite the generality of the framework. For the tasks of contour detection and image segmentation, we have a noisy real-valued image D and we seek to recover a binary-valued map B that is the same size as D. We assume that D is obtained by sampling each pixel D(i, j) independently from a Normal distribution whose mean depends on the value of B(i, j) with some known standard deviation σ. Formally, D(i, j) ∼ N (µB(i,j) , σ). The goal of inference is to recover B from D. For contour detection, we evaluate on the Berkeley Segmentation Dataset (BSD500) described in [1] using a standard train/test split. For image segmentation, we evaluate on a subset of the Swedish Leaf Dataset described in [53]. For the task of face localization, we have have images with one or more faces. The task is given an image, localize the face(s) and the parts of each face. We evaluate on a subset of Labelled Faces in the Wild (LFW) dataset introduced in [27], and our own dataset of family portraits collected from the Internet. We manually annotate each image with bounding box information of all faces and their parts. Figure 1.4 shows examples of the inputs, desired outputs, and actual outputs from the PSG framework on these scene understanding tasks. 1.2.5 General implementation In this thesis, experiments involving the PSG framework were performed using a single, general implementation of the PSG framework. The ideas and formalisms outlined in this thesis not only provide a conceptual framework in which one can reason abstractly about different scene understanding tasks, but also allows one to realize a concrete, unified framework to handle diverse scene understanding tasks in practice. 8 To handle different tasks in this general implementation, one simply expresses a model in a high-level “language”, and designs an appropriate data model. The implementation automatically constructs a data structure representing the probabilistic model (or an approximation of it), and performs parameter estimation (learning) and inference with the model. The contribution here is of an engineering nature: the knowledge that it is possible to take the conceptual framework outlined in this thesis and concretize them in a truly general implementation. This approach to a general implementation for generic tasks is similar to the approach of the Probabilistic Programming Language (PPL) community [48, 34, 58] whereby a user specifies an appropriate data model and how to sample from a prior probability distribution, and a potentially suitable inference algorithm is automatically constructed by the PPL framework. We outline the connections to this community in 1.3. 1.3 Related work The desire for a general-purpose computational framework for scene understanding tasks is shared by the PPL community. In particular, the Picture and Edward frameworks described in [34] and [58], respectively, share the high-level goal of having a general-purpose representation and inference engine for scene understanding. However, these works differ from the PSG framework in both the goal and method of inference. Picture and Edward seek to find high-probability scene representations encoded as probabilistic program traces via Markov-Chain Monte-Carlo (MCMC) sampling methods and variational inference schemes. The PSG framework finds marginal distributions over aspects of the scene using LBP. The incorporation of the data model differs substantially as well. For example, in the Picture framework, the data model is combined with a prior over scenes using a computer-graphics renderer. In contrast, the PSG framework incorporates data terms using unary potentials in a factor graph defined in terms of extracted features. Although the incorporation of data terms is simpler in the PSG framework than in many PPL frameworks, the abilities to handle explaining-away and generate photo-realistic images are sacrificed by the PSG framework. Lastly, PPL approaches such as Picture and those proposed by Ritchie (see [48]) take the view of performing inference as analysis-by-synthesis, or as a Bayesian inverse-graphics problem. In contrast, the PSG framework frames the problem of inference in a purely analytical approach whereby generating realistic images is not a goal of the framework. The PSG framework describes scene understanding tasks in a compositional framework. The idea of performing scene understanding in a compositional framework has been a long-standing goal in computer vision. The notion of perceptual organization using grouping and compositional rules 9 goes back at least to the Gestalt theory of perception described by [45] and the “neocognitron” model of [19]. The idea of representing scene understanding tasks in a compositional framework has also been investigated in modern approaches. A major relevant work in this vein is the work of [28] whereby a hierarchical “compositional machine” describes relationships between entities in a Bayesian network. The representation scheme used by the PSG framework is inspired by [28], however, the PSG framework is subtly different as it uses a factor graph to represent the distribution over scenes. For inference, [28] uses a greedy search heuristic to find plausible interpretations of the scene. In contrast, the PSG framework uses LBP for approximate inference. Also, we study the performance of the PSG framework on a more diverse set of tasks; while [28] studies the task of reading vehicle license plates, we consider the tasks of contour detection, face localization, and image segmentation. Deformable Part Models (DPM) [12] and Pictorial Structures (PS) [13] are compositional frameworks that the PSGs framework takes much inspiration from. DPM and PS represent objects as a collection of parts and connections between parts and can be understood as a special kind of probabilistic grammar. The form of PS and DPM models allows for efficient exact inference via dynamic programming. Further, PS assumes there is one of each object in the image, while the PSG framework makes no such assumption. The PSG framework considers more general object models and make fewer assumptions about scenes. The trade-off, however, is the inference scheme the PSG framework employs is only approximate and in practice, is slower than the exact inference schemes used by DPM and PS. Nevertheless, the scope of tasks representable in the PSG framework is larger and, as we show in Chapter 9, is capable of outperforming PS on a face localization task. There has been much work in the area of inference for probabilistic compositional models similar to the PSG framework. The problem of exact inference with general probabilistic models is NP-hard (see [8]). Indeed, efficient inference has been the bane of many probabilistic compositional models. To deal with inference in such models, a variety of ad-hoc or slow sampling schemes have been proposed in the literature (see [34, 58, 28, 59, 68]). For example, the works of [34], [58], [59] and [68] use MCMC techniques for inference, and [28] uses a coarse-to-fine greedy approach to search for potential objects. Ad-hoc heuristics are brittle and are often only applicable to a narrow range of situations. Approaches that rely on MCMC sampling schemes can also be brittle if the MCMC scheme relies on the design of effective proposal distributions. In this work, we use LBP as it is robust in practice and requires relatively little problem-specific engineering. To our knowledge, this thesis is the first to employ LBP for inference with a probabilistic grammar. The problem of inference for general probabilistic models is a main area of research in the field of machine learning. As such, there are potential alternatives to the LBP approximate inference scheme used in the PSG framework. Variational Inference [63] is a well-studied approximate inference 10 scheme that is fast in practice. However, Variational Inference tends to produce inferior results to LBP in certain situations [65]. MCMC techniques such as Gibbs sampling [20], Metropolis-Hastings [25], Hybrid Monte Carlo [44], and Slice Sampling [43] can also be used to perform inference in the kinds of probabilistic models considered by the PSG framework. Indeed, as stated earlier in this chapter, past approaches have used MCMC techniques. However, MCMC techniques in practice can suffer from being slow to converge to the target posterior distribution and may require careful tuning and design, which makes them not ideal for handling general scene understanding tasks where time is of the essence. Other message-passing schemes for inference in loopy graphs exist, such as Tree-reweighted Belief Propagation [30], Generalized Belief Propagation [67], Convergent Belief Propagation [38], and Non-parameteric Belief Propagation [56]. It is possible to employ any of these message-passing schemes as the inference engine in the PSG framework. Tree-reweighted Belief Propagation in particular can be useful if LBP has convergence issues, and Generalized Belief Propagation can be useful when one wishes to trade inference speed for increased inference accuracy. In this work, we use LBP as the inference engine since it is a relatively simple message-passing scheme, and we can exploit the particular form of the probabilistic models expressible in the PSG framework to perform efficient message computation. 1.4 Thesis organization The outline below specifies the organization of this thesis. • Chapter 2: a formal description of the representation (a probabilistic grammar) used by the PSG framework. • Chapter 3: some example grammars that can be specified in the PSG framework. • Chapter 4: description of the transformation of a PSG model into a factor graph. • Chapter 5: description of the approximate inference scheme used in the PSG framework. In particular, Chapter 5 contains the derivations of the LBP message-passing equations and characterizes the time-complexity of computing messages in the PSG factor graph. • Chapter 6: examples of running LBP on the example grammars specified in Chapter 3. • Chapter 7 elucidates the connections between the PSG framework and the Pictorial Structures model of [13]. • Chapter 8: description of the approximate EM learning algorithm used in the PSG framework. 11 • Chapter 9: experimental evaluation of the PSG framework on the tasks of contour detection, face localization, and binary image segmentation. • Chapter 10: description of PSG model transformations that allow for even faster approximate inference. • Chapter 11: summary of research contributions and suggestions for future research directions that build off the PSG framework. 12 (a) Samples of contour maps generated by a model of contours. Note that the contours are of varying lengths and shapes and have variable curvature. (b) Samples of faces generated by a model of faces. Note the geometric variability in the locations of the parts of the face and the variable number of faces in each scene. This face model allows parts of the face to appear on their own. (c) Samples of binary image segmentation maps generated by an image segmentation model. Foreground is shown in black, background is shown in white. The model used here constrains the foreground to be a single connected component but allows for “holes” in the foreground. Figure 1.3: Samples from models used for contour detection, face localization, and binary image segmentation. All models are expressed in the PSG framework. 13 (a) The task of contour detection. Left: a noisy input image D. Middle: the desired output: a binary contour map, B. Right: Visualization of the approximate marginal probabilities p̂(B | D) computed by the PSG framework. Darker pixels indicate a higher approximate marginal probability. (b) The task of face localization. Here, we wish to localize faces, left eyes, right eyes, noses, and mouths. Left: an input image D. Middle: the desired output: a localization of the face and each of its parts in terms of bounding boxes. Right: The top K detections for faces and their parts, where K is the ground truth number of faces in the image. (c) The task of binary image segmentation. Left: a noisy input image D. Middle: the desired output: a binary segmentation map, B. Right: Visualization of the approximate marginal probabilities p̂(B | D) computed by the PSG framework. Darker pixels indicate a higher approximate marginal probability. Figure 1.4: Examples of the scene understanding tasks we use to evaluate the PSG framework. Chapter 2 Probabilistic Scene Grammars We take a Bayesian point of view where the goal of a computer vision algorithm is to estimate a description of a scene from a set of observations. A key component of this approach is a prior model over scenes, p(S), that captures the statistical regularities of scenes in the world. A probabilistic scene grammar (PSG) defines a set of possible scenes and a probability distribution over them. Scenes are defined using a library of building blocks, or bricks. Each brick is a pair of a type and a pose. The type is a symbol from a finite alphabet and the pose is an element from a finite pose space. For example, one brick in a scene might be the pair (FACE, (30, 40)) representing a face at location (30, 40) in the image. We capture structural and geometric relationships between bricks using a library of production rules. To define a distribution over scenes p(S) we consider a process for generating random scenes using a set of production rules. The process starts from an initial set of bricks that are spontaneously generated. Each of the initial bricks is probabilistically expanded to generate new bricks. This process continues until all bricks in the scene have been expanded. The result is a set of bricks organized in a hierarchical fashion. The formal definition of this process is given below. In the next chapter we describe some example grammars and illustrate the random scenes they generate. In a probabilistic scene grammar the initial generation of bricks in a scene is governed by selfrooting probabilities. The possible expansions of a brick into other bricks is determined by a set of production rules, rule selection probabilities and conditional pose distributions. Definition 1 A probabilistic scene grammar (PSG) is defined by a 6-tuple G =(Σ, Ω, R, q, , γ) . 1. Σ is a finite set (the symbols). 2. Ω = { ΩA | A ∈ Σ } where ΩA is a finite set (the pose spaces). 3. R is a finite set of production rules of the form A0 → A1 , . . . , An where n ≥ 0 and Ai ∈ Σ. 14 15 Let r be a rule in R. We use nr to denote the number of symbols in the right-hand-side (RHS) of r. We use A(r,0) to denote the left-hand-side (LHS) of rule r and A(r,i) , 1 ≤ i ≤ nr , to denote the i-th symbol in the RHS of r. We denote the set of rules with symbol A in the LHS by RA . 4. q = { qA | A ∈ Σ } where qA is a distribution over RA (the rule selection probabilities). 5.  = { A | A ∈ Σ } is a set of probabilities (the self-rooting probabilities). 6. γ = { γ(ω,r,i) | r ∈ R, 1 ≤ i ≤ nr , ω ∈ ΩA(r,0) } is a set of conditional pose distributions. Each conditional pose distribution γ(ω,r,i) has an associated set of parameters θ(ω,r,i) indexed ΩA(r,i) by ΩA(r,i) . We have γ(ω,r,i) : {0, 1} X → R≥0 with γ(ω,r,i) (W | θ(ω,r,i) ) = 1 ∀ω ∈ ΩA(r,0) W ΩA(r,i) where the summation is over all possible values of W ∈ {0, 1} . We use θ = {θ(ω,r,i) | r ∈ R, 1 ≤ i ≤ nr , ω ∈ ΩA(r,i) } to denote the set of parameters that govern the conditional pose distributions γ. Intuitively, the conditional pose distributions γ model geometric and cardinality relationships between bricks. For example, consider a FACE at location (30, 40) in the image. A conditional pose distribution could model that a FACE has exactly one NOSE, and model the distribution over the location of the NOSE of the FACE. As another example, consider an EYE at location (30, 40) in the image. A conditional pose distribution could model how many EYELASHES an EYE has, and the distribution over locations of the EYELASHES of the EYE. In this thesis, we consider two kinds of conditional pose distributions: the Categorical distribution, and the IndBern (short for Independent-Bernoullis) distribution, defined below. Below, let W be a set of binary random variables indexed by Υ. Define the set I(W ) = {k | Wk = 1, k ∈ Υ}. Definition 2 Let Υ be an index set for W and a set of parameters θ = {θk | k ∈ Υ, 0 ≤ θk ≤ P 1, θk = 1}. We define k∈Υ Categorical(W | θ) =  Q Wk   θk , k∈Υ  0, P Wk = 1 k∈Υ otherwise. (2.1) 16 Definition 3 Let Υ be an index set for W and a set of parameters θ = {θk | k ∈ Υ, 0 ≤ θk ≤ 1}. We define IndBern(W | θ) = Y θkWk (1 − θkWk )1−Wk . (2.2) k∈Υ Note that the IndBern distribution is defined in terms of independent but not identicallydistributed Bernoulli distributions. Also, in a Categorical distribution, a single binary random variable from the set W has value 1, and in an IndBern distribution, the binary random variables are independent. Consider a rule r ∈ R and the i-th symbol in the RHS of r. A Categorical distribution is useful to model a situation in which a brick of type A(r,0) generates exactly one brick of type A(r,i) (e.g., a FACE has one NOSE). An IndBern distribution is useful to model a situation in which there is a set of bricks of type A(r,i) that a brick of type A(r,0) can generate, and elements from the set are selected independently (e.g., an EYE may have any number of EYELASHES above the EYE). Note that unlike a context-free grammar model used in natural language processing, a scene grammar has no start symbol and instead we have self-rooting probabilities. We also make no distinction between terminal and non-terminal symbols, and allow for rules with empty right-handside. A scene generated by a scene grammar is defined in terms of a finite set of available bricks. Definition 4 The bricks defined by a grammar G are pairs of symbols and poses, B = { (A, ω) | A ∈ Σ, ω ∈ ΩA }. Definition 5 A scene S is defined by: 1. A set O ⊆ B of bricks that are present in the scene. 2. For each brick (A0 , ω) ∈ O we have a rule r = A0 → A1 , . . . , An ∈ RA0 and ∀ 1 ≤ i ≤ nr , we have a value Wi ∈ {0, 1}ΩAi such that ∀z ∈ I(Wi ), (Ai , z) ∈ O. We say that a brick (A0 , ω) expands to, or is a parent of, the set of bricks {(Ai , z) | 1 ≤ i ≤ nr , z ∈ I(Wi )}. Let S be the set of scenes defined by a scene grammar G. The set S is the “Language” generated by G. To generate a scene we consider a random algorithm that grows a scene starting from an initial set of random bricks. The scene generation process starts from an initial set of bricks that are included in the scene independently at random. We then repeatedly expand bricks in the scene that have not been expanded before. The expansion of a brick generates new bricks that are added to the scene and expanded further. This random algorithm defines a distribution, p(S), that can capture regularities in natural 17 scenes. For example, the process can capture which objects tend to co-occur in a scene and the typical relative positions between different objects. To formally define the scene generation process we use a set O to keep track of bricks in the scene and a set Q to keep track of bricks that are in the scene but have not been expanded yet. Initial bricks are included in O independently according to self-rooting probabilities. All of these bricks are queued for expansion in Q. If an expansion generates a brick that is not already in O we add the brick to O and queue it for expansion in Q. Definition 6 A probabilistic scene grammar G defines a random algorithm for generating scenes: 1. Initially O = ∅ and Q = ∅. 2. For each brick (A, ω) ∈ B we add (A, ω) to O and Q with probability A . 3. While Q 6= ∅ we remove a brick (A, ω) from Q and expand it. 4. Expanding (A, ω) involves (a) sampling a rule r = A0 → A1 , . . . , An ∈ RA according to qA , (b) for 1 ≤ i ≤ nr , sampling a set Wi of binary values according to γ(ω,r,i) (Wi | θ(ω,r,i) ), and for each z ∈ I(Wi ), if (Ai , z) 6∈ O adding it to both O and Q. The scene S is defined by O and the choices made when expanding each brick in O. The output of this algorithm defines a distribution p(S) over scenes in S. We note that the scene generation algorithm terminates after a finite number of expansions bounded by the total number of bricks in B. As discussed above the queue Q keeps track of bricks that are in the scene but have not been expanded yet. When Q is empty every brick in O has been expanded exactly once. Therefore when the process terminates we have a scene S ∈ S. We also note that the order in which the bricks from Q are selected for expansion does not affect the probability of generating a particular scene. Therefore the arbitrary choice of expansion order does not change the distribution over scenes defined by the algorithm. Remark 7 Scene grammars are related to context-free grammars used in language modeling. We note however that they generate different types of structures. Recall that a context-free grammar generates rooted derivation trees, where the vertices are labeled with symbols from a finite alphabet. In a derivation tree there is a single vertex (the root) with no parents and every other vertex has a unique parent. 18 A scene generated by a scene grammar defines a directed graph G over the bricks that are present in the scene. The edges of the scene graph capture the parent relationship over the bricks in a scene. We note that G resembles a derivation tree, but it has more general connectivity structure. In particular we can have multiple vertices with no parents (roots) in G, and the graph can have multiple disjoint components. We can also have vertices with multiple parents in G. Therefore multiple roots can lead to the same vertex and there can be multiple paths from one vertex to another. The scene graph will also have directed cycles when a brick (A, ω) in the scene leads to a sequence of expansions that eventually generate (A, ω) again. Finally we note that every scene graph is a subgraph of the complete directed graph over B, and the number of possible scene graphs is finite (although it can be very large). This is in contrast to the fact that a context-free grammar can generate trees of unbounded size. Chapter 3 Example grammars In this chapter we give some examples of PSGs and illustrate the random scenes they generate. Recall that a PSG G is defined by a 6-tuple (Σ, Ω, R, q, , γ) . In the examples below we combine the description of R, q, and γ to simplify the notation. Let r = A0 → A1 , . . . , An be a rule in R. To specify the rule r, the rule selection probability qr , and the conditional pose distributions associated with r, we write, qr , (A0 , ω0 ) → (A1 , γ(ω0 ,r,1) (·|θ(ω0 ,r,1) )), . . . , (An , γ(ω0 ,r,n) (·|θ(ω0 ,r,n) )). (3.1) In the examples in this chapter, the pose spaces are grids of integer points [N1 ] × · · · × [ND ] where [N ] = {0, . . . , N − 1}. Denote such a pose space by Υ. Below, we use Rect(a, b) to indicate the set of grid points in the hyperrectangle with diagonal (a, b). We define two special kinds of Categorical distributions; the UniformRect distribution and a distribution concentrated at a single point in the grid. We also define a special kind of IndBern distribution: a UniformBern (short for Uniform-Independent-Bernoullis) distribution. Definition 8 Let W be a set of binary random variables indexed by Υ. Let a and b be two elements of Υ. We define the UniformRect distribution as UniformRect(W ; a, b) =  1   | Rect(a,b)| ,  0, P Wk = 1, I(W ) ⊆ Rect(a, b) k∈Υ otherwise where | Rect(a, b)| denotes the size of the set Rect(a, b). We denote a distribution concentrated at a single point by δ(W ; a) = UniformRect(W ; a, a). 19 (3.2) 20 Definition 9 Let W be a set of binary random variables indexed by Υ and let T ⊆ Υ be a set. We define the UniformBern distribution as UniformBern(W ; T, θ) = Y θkWk × (1 − θk )1−Wk (3.3) k∈Υ θk  θ, k ∈ T = 0, otherwise. (3.4) For brevity, for the rest of this thesis we drop the argument W from the distributions above, and will denote them as UniformRect(a, b), δ(a), and UniformBern(T, θ). 3.1 Scenes with curves Grammar 1 generates scenes with discrete curves. Figure 3.1 shows some images generated by this model. The grammar generates scenes with a random number of curves and where each curve has a random length and shape, giving preference to curves with low-curvature. The approach is related to the Elastica model in [41] where the tangent function of a random curve is defined by a random walk. In Chapter 6 we show how this model can be used for contour completion and in Chapter 9 we show how the model can be used to detect curves in noisy images. A curve is represented by a sequence of oriented elements. Curves are extended one element at a time, moving from one pixel in the image to a neighboring pixel in a direction close to the current orientation. At each step a curve can also end or change orientation with small probability. As a curve is generated the process leaves a trace of ink in the image. The grammar has two symbols, Σ = {CURVE, INK}. The CURVE bricks represent oriented elements that are connected sequentially to form curves. The pose of a CURVE brick specifies a pixel location and one of 8 possible orientations. The INK bricks represent the pixels that are covered by a curve and capture what we see in an image. The pose of an INK brick specifies only a pixel location and has no orientation information. Grammar 1 A grammar for 2D images with curves. The function Tθ denotes a rotation in the plane by an angle θ and Round maps a point in the plane to the nearest grid point. 21 Figure 3.1: Random images generated using Grammar 1. The black pixels represent the INK bricks that are present in a random scene. The grammar generates discrete curves of varying lengths and shapes, giving a preference to curves with low curvature. 22 Σ = {CURVE, INK}. ΩCURVE = [N ] × [M ] × [8]. ΩINK = [N ] × [M ]. Rules: 0.65, (CURVE, (x, y, θ)) → (INK, δ((x, y))), (CURVE, δ(((x, y) + Round(Tθ (1, 0)), θ))) 0.10, (CURVE, (x, y, θ)) → (INK, δ((x, y))), (CURVE, δ(((x, y) + Round(Tθ (1, −1)), θ))) 0.10, (CURVE, (x, y, θ)) → (INK, δ((x, y))), (CURVE, δ(((x, y) + Round(Tθ (1, +1)), θ))) 0.05, (CURVE, (x, y, θ)) → (CURVE, δ((x, y, θ + 1))) 0.05, (CURVE, (x, y, θ)) → (CURVE, δ((x, y, θ − 1))) 0.05, (CURVE, (x, y, θ)) → (INK, δ((x, y))), 1.00, (INK, (x, y)) CURVE = INK = 10−4 . → ∅ The first three rules that can be used to expand a CURVE brick capture the possible extensions of a curve along a direction that is close to the current orientation. When we extend a curve at pixel (x, y) with orientation θ, we move to one of 3 neighbors of (x, y) that are approximately in the direction θ. Figure 3.2 illustrates the possible extensions for a horizontal element. The last three rules that can be used to expand a CURVE brick capture changes in orientation and the ending of a curve. The probability of changing the current orientation is small, so curves tend to take multiple steps along a single discrete orientation before turning. As we generate a curve we also generate INK bricks tracing the path of the curve. 3.2 Scenes with faces Grammar 2 generates scenes with faces and parts of faces. The model captures the notion that each scene has a variable number of objects, and that faces have certain parts at appropriate locations. We also allow for parts of faces to appear on their own, capturing the notion that a scene is made up of a set of faces and other components that look like parts of faces. Figure 3.3 shows some examples of random scenes generated by this grammar. In this grammar the pose of a brick specifies a 2D location for an object of fixed size and orientation. The parameters N and M denote the number of pixels in each dimension of a 2D image. The compositional model for a face is captured by the rule FACE → EYE, EYE, NOSE, MOUTH. When expanding a FACE brick, the possible locations for the parts are defined relative to the face location. In the grammar considered here, the location of each part is selected uniformly at random from a rectangular region defined relative to the location of the face. Figure 3.4 shows the possible locations for the parts when a face is at the origin. This part-based representation for a face captures 23 Figure 3.2: A depiction of the possible extensions of a curve by one pixel. In this case the horizontal CURVE brick indicated by the red pixel expands to a CURVE brick of the same orientation in one of the blue pixels with the indicated probabilities. The remaining probability mass is reserved for the choice to end the curve or change its orientation. Figure 3.3: Random scenes with faces and parts of faces generated using Grammar 2. Faces are represented by red rectangles, eyes by blue circles, noses by green triangles, and mouths by magenta rectangles. Scenes have multiple objects and parts of faces can appear both in the context of a face and on their own. The location of a part, such as the nose, can vary within a range of possible locations relative to the face. 24 pairwise relationships between locations of different parts and is similar to a Pictorial Structures model ([18, 13]). Grammar 2 A grammar for 2D scenes with faces and parts of faces: Σ = {FACE, EYE, NOSE, MOUTH}. ∀A ∈ Σ, ΩA = [N ] × [M ]. Rules: 1.0, (FACE, ω) → (EYE, UniformRect(ω + a1 , ω + b1 )), (EYE, UniformRect(ω + a2 , ω + b2 )), (NOSE, UniformRect(ω + a3 , ω + b3 )), (MOUTH, UniformRect(ω + a4 , ω + b4 )) 1.0, (EYE, ω) → ∅ 1.0, (NOSE, ω) → ∅ 1.0, (MOUTH, ω) → ∅ FACE = 10−4 , EYE = NOSE = MOUTH = 10−5 . Figure 3.4: A depiction of the possible locations of the face parts when the FACE is located at the pixel indicated by the red circle. The blue, green, and magenta pixels indicate the possible locations for the EYE, NOSE, and MOUTH symbols, respectively. We note that Grammar 2 can be extended to represent objects of different sizes and orientations by augmenting the pose spaces with scale and orientation information. In Chapter 9, we show how a similar model can be used for face detection. The grammar defined above can also be extended to 25 define scenes with multiple objects from different categories where each object is defined in terms of a set common parts. 3.3 Scenes with binary segmentation masks Grammar 3 generates a binary segmentation mask. The foreground generated by this grammar is a single, non-empty connected component of pixels where connections are considered in an 8-neighbourhood around a grid point. Figure 3.5 shows some images generated by this grammar. As shown in the figure, the foreground generated may have “holes” in it. Grammar 3 A grammar for 2D foreground/background image segmentation for an N × M scene: Σ = {SEED, FG}. ΩSEED = [1]. ΩFG = [N ] × [M ]. Rules: 1.0, (SEED, ω) → (FG, UniformRect((1, 1), (N, M ))) 1.0, (FG, ω) SEED = 1, → (FG, UniformBern(Rect(ω − (1, 1), ω + (1, 1)) \ ω, 0.25)) FG = 0. Note that Rect(ω − (1, 1), ω + (1, 1)) \ ω is the set of points in the rectangle with diagonal (ω − (1, 1), ω + (1, 1)) excluding the centre of the rectangle. That is, the 8 neighbours of pixel ω. Grammar 3 can be thought of as assigning a label (foreground or background) to each grid point. The approach is related to an Ising model on a grid, but here we consider the 8-neighbourhood around a grid point rather than the 4-neighbourhood. As in the Ising model, a point and its neighbours are encouraged to have the same label. Unlike the Ising model, however, the assignment of labels to grid points can be formulated in a generative process that is guaranteed to produce a single, non-empty, connected foreground component. Further, the set of labelings that can be produced by the grammar is exactly the set of labelings such that there is a single, non-empty connected foreground component. The grammar has two symbols, Σ = {SEED, FG}. Intuitively, the SEED symbol selects a location in the image from which to start growing the foreground. This guarantees that there is at least one grid point labelled foreground in the image. Each grid point labelled foreground selects a subset of its 8 neighbours to be foreground as well; each of its neighbours is considered independently and selected with probability 0.25. Figure 3.6 illustrates for a given FG brick, the set of other FG bricks it can generate and with what probability it does so. Since FG = 0, and the generative process of expanding a brick (FG, ω) considers selecting other FG bricks in an 8-neighbourhood around ω, 26 Figure 3.5: Random scenes generated using Grammar 3. The black pixels represent the FG bricks that are present in a random scene. The model generates a single, non-empty, connected (in the 8-neighbourhood sense) foreground segment. 27 Figure 3.6: A depiction of the possible generations of a FG brick. Indicated in red is the location of the FG brick being expanded. The possible FG bricks that may be generated by this brick are indicated in blue, and the probability of generation is shown. The potentially generated bricks are considered independently. the generative process is guaranteed to create a single connected component. Note that the expected number of bricks a brick (FG, ω) expands to is 2, and so one may be concerned that the generative process will never terminate. Recall that in the generative process described in Definition 6, a brick can only be expanded once, so the generative process will terminate with probability 1. We will expand at most N × M FG bricks, and 1 SEED brick. Chapter 4 Factor Graph Representation A scene grammar defines a probability distribution, p(S), over scenes. Here we describe a factorization of p(S) and a representation of this distribution by a factor graph with a finite number of binary random variables. In practice the factor graph representation can be used as a data structure for inference. In particular we can use this representation for computing posterior marginals with Loopy Belief Propagation (Chapter 5). The factor graph formulation can also be used for learning model parameters with an approximate EM algorithm (Chapter 8). We start by considering a representation of scenes using a finite set of binary random variables. Definition 10 For a brick (A, ω) ∈ B, a rule r ∈ RA , and 1 ≤ i ≤ nr . Define Γ(ω,r,i) = {ω 0 | θ(ω,r,i,ω0 ) > 0}. Note that Γ(ω,r,i) ⊆ ΩA(r,i) since ΩA(r,i) indexes θ(ω,r,i) . The set {(A(r,i) , z) | z ∈ Γ(ω,r,i) , 1 ≤ i ≤ nr } is the set of bricks that brick (A(r,0) , ω) can generate when rule r is chosen. Definition 11 A scene S generated by a grammar G defines a collection of binary random variables associated with each brick (A, ω) ∈ B, X(A, ω) ∈ {0, 1}, (4.1) R(A, ω) = { R(A, ω, r) ∈ {0, 1} | r ∈ RA }, (4.2) C(A, ω) = { C(A, ω, r, i, ω 0 ) ∈ {0, 1} | r ∈ RA , 1 ≤ i ≤ nr , ω 0 ∈ Γ(ω,r,i) } (4.3) where X(A, ω) = 1 if (A, ω) is in the scene, 28 29 R(A, ω, r) = 1 if rule r is used to expand (A, ω), C(A, ω, r, i, ω 0 ) = 1 if (A, ω) is expanded with rule r, and brick (A(r,i) , ω 0 ) is one of the bricks generated when considering the i-th symbol in the RHS of the rule. We note that a scene S is uniquely defined by the value of the random variables {X, R, C}. Let G be a scene grammar. We say G is acyclic if there is no sequence of expansions that generates a brick starting from itself. To make this notion precise let H be a directed graph over the bricks, with an edge from (A, ω) to (B, z) if we can generate (B, z) from (A, ω) in one expansion. The grammar G is acyclic if H is acyclic. For example, the grammar for scenes with faces in Section 3.2 is acyclic. On the other hand, the grammar for scenes with curves in Section 3.1 is cyclic, because a sequence of expansions starting from a CURVE brick can generate the initial brick again. A topological ordering of B is a linear ordering of B such that (A, ω) appears before (B, z) whenever (A, ω) can generate (B, z) after one or more expansions. We note that when G is acyclic there is always a topological ordering of B and such ordering can be computed by topological sorting the vertices of H. 4.1 Factorization Let p(X, R, C) denote the distribution defined by the scene generation algorithm. For an acyclic grammar the distribution p(X, R, C) can be factored into a product of local potential functions. The factorization gives a simple closed form expression for p(X, R, C) and leads to a factor graph representation that can be used for inference with a scene grammar. The factorization described here is analogous to the expression of the joint distribution in a Bayesian network. We note that the factorization is only exact for acyclic grammars but it can also be used in practice as an approximation for inference with cyclic grammars. There are three types of factors in the factorization of p(X, R, C). Below, for a set of binary values W , let c(W ) be the number of ones in W . The three types of factors are illustrated in Figure 4.1 and defined below. Definition 12 A Leaky-OR potential ΨL  (Y, z) is a function of a set of binary inputs Y = {y1 , · · · , yn } and a binary output z. It represents the conditional probability of each possible output in a probabilistic OR gate. If c(Y ) > 0 we have z = 1 with probability 1. If c(Y ) = 0 we have z = 1 with probability . 30 (a) Leaky-OR factor. (b) Selection factor. (c) Berns factor. Figure 4.1: The three types of factors in the factorization of p(X, R, C).    1 z = 1, c(Y ) > 0,      0 z = 0, c(Y ) > 0, ΨL (Y, z) =     z = 1, c(Y ) = 0,      1 −  z = 0, c(Y ) = 0. Definition 13 A Selection potential ΨSθ (y, Z) is a function of a binary input y and a set of binary outputs Z = {z1 , · · · , zn }. This factor models the selection of a random output. If y = 0, then the output Z such that c(Z) = 0 is selected with probability 1. If y = 1, then exactly one of the zi has value 1. The choice of which zi to set to 1 (select) is governed by the probabilities defined by θ.   y = 0, c(Z) = 0, 1   S Ψθ (y, Z) = 0 y = 0, c(Z) > 0,     Categorical(Z | θ) y = 1. Definition 14 A Berns potential ΨB θ (y, Z) is a function of a binary input y and a set of binary outputs Z = {z1 , · · · , zn }. This factor models the selection of multiple outputs conditional on y. If y = 0, then the output Z such that c(Z) = 0 is selected with probability 1. If y = 1, then zi = 1 with probability θzi .   1 y = 0, c(Z) = 0,    ΨB 0 y = 0, c(Z) > 0, θ (y, Z) =     IndBern(Z | θ) y = 1. Our main observation is that p(X, R, C) can be expressed in closed form in terms of a product of potentials of the types defined above. 31 To formulate the factorization we consider the following collections of random variables, C(A, ω, r, i) = { C(A, ω, r, i, ω 0 ) | ω 0 ∈ Γ(ω,r,i) }, par(X(A, ω)) = { C(B, ω 0 , r, i, ω) | B ∈ Σ, ω 0 ∈ ΩB , r ∈ RB , 1 ≤ i ≤ nr , A(r,i) = A, ω ∈ Γ(ω0 ,r,i) }. The set C(A, ω, r, i) includes all the poses that can be associated with the i-th child of brick (A, ω) if rule r is used to expand (A, ω). The set par(X(A, ω)) includes all the random variables that can indicate a parent of X(A, ω) in the scene. Proposition 15 The distribution p(X, R, C) defined by an acyclic grammar G can be expressed as, p(X, R, C) =   Y  Y  p(X(A, ω) | par(X(A, ω)))p(R(A, ω) | X(A, ω)) p(C(A, ω, r, i) | R(A, ω, r))  . r∈RA , 1≤i≤n(r) (A,ω)∈B (4.4) Moreover if G is acyclic we have, p(X(A, ω) = z | par(X(A, ω)) = Y ) = ΨL A (Y, z) (4.5) p(R(A, ω) = Z | X(A, ω) = y) = ΨSqA (y, Z) (4.6) p(C(A, ω, r, i) = Z | R(A, ω, r) = y) = ΨSθ(ω,r,i) (y, Z) or ΨB θ(ω,r,i) (y, Z) (4.7) Proof Let Vi = {Xi , Ri , Ci } denote the random variables associated with the i-th brick in a topological ordering of B. We can write p(X, R, C) = Y p(Vi | Vj 0 and xu µu→f (xu ) = 1, ∀u ∈ N (f ) and xu ∈ {0, 1}. Theorem 22 All messages from a Leaky-OR factor node to all of its neighbouring variable nodes can be computed in time linear in the degree of the factor node. To prove Theorem 22, we require two lemmas. Lemma 23 The messages µf →z (xz ) can be expressed as µf →z (0) = (1 − ) Y µy→f (0) (5.5) y∈Y µf →z (1) = 1 − µf →z (0). (5.6) Lemma 24 The messages µf →y (xy ) ∀y ∈ Y can be expressed as µf →y (0) ∝ µz→f (1) + Y   µu→f (0) (1 − )(µz→f (0) − µz→f (1)) (5.7) u∈Y \y µf →y (1) ∝ µz→f (1) (5.8) where the constants of proportionality are chosen so that µf →y (0) + µf →y (1) = 1. Proof of Lemma 23 Substituting the form of the Leaky-OR potential defined in Definition 12 into the general message passing equation given in Eqn. 5.2 and noting that the quantity of interest is µf →z (xz ) yields: 41 µf →z (xz ) ∝ X ΨL  (xY , xz ) xY Y µy→f (xy ) (5.9) y∈Y where the summation is over all possible configurations of xY . Consider the case xz = 0: µf →z (0) ∝ X ΨL  (xY , 0) xY µy→f (xy ) (5.10) y∈Y X = Y (1 − ) Y µy→f (xy ) (5.11) y∈Y xY :c(xY )=0 = (1 − ) Y µy→f (0). (5.12) y∈Y L Note that ΨL  (xY , 1) = 1 − Ψ (xY , 0) and so following the derivation above, we have µf →z (1) ∝ 1 − (1 − ) Y µy→f (0). (5.13) y∈Y The constants of proportionality in Eqns. 5.10 and 5.13 can be set to 1 to ensure P xz µf →z (xz ) = 1. Proof of Lemma 24 Substituting the form of the Leaky-OR potential defined in Definition 12 into the general message passing equation given in Eqn. 5.2 and noting that the quantity of interest is µf →y (xy ), y ∈ Y yields: µf →y (xy ) ∝ XX xY \y xz where P xY \y ΨL  (xY , xz )µz→f (xz ) Y µu→f (xu ) u∈Y \y is a summation over all configurations of xY \y . Consider the case xy = 0: (5.14) 42 X µf →y (0) ∝ X xY \y :c(xY \y )=0 xz X + Y ΨL  (xY , xz )µz→f (xz ) Y ΨL  (xY , xz )µz→f (xz ) Y µu→f (0) + (1 − )µz→f (0) Y µz→f (1) µu→f (0) (5.16) µu→f (xu ) u∈Y \y xY \y :c(xY \y )>0 = Y u∈Y \y X Y µu→f (xu ) u∈Y \y u∈Y \y + (5.15) u∈Y \y X xY \y :c(xY \y )>0 xz = µz→f (1) µu→f (xu )   µu→f (0) (1 − )(µz→f (0) − µz→f (1)) + µz→f (1) (5.17) u∈Y \y + X µz→f (1) xY \y = Y Y X µu→f (xu ) − u∈Y \y µz→f (1) xY \y :c(xY \y )=0 Y µu→f (xu ) u∈Y \y   µu→f (0) (1 − )(µz→f (0) − µz→f (1)) + µz→f (1) (5.18) u∈Y \y +µz→f (1) − µz→f (1) Y µu→f (0) u∈Y \y = µz→f (1) + Y   µu→f (0) (1 − )(µz→f (0) − µz→f (1)) . (5.19) u∈Y \y Consider the case xy = 1: µf →y (1) ∝ X ΨL  (xY , 1)µz→f (1) xY \y = X Y µu→f (xu ) (5.20) u∈Y \y µz→f (1) xY \y Y µu→f (xu ) (5.21) u∈Y \y = µz→f (1). (5.22) Proof of Theorem 22 From Lemma 23, it is clear that messages µf →z (xz ), xz = {0, 1} can be computed in time O(|Y |). From Lemma 24, the computation of µf →y (1) ∀y ∈ Y is trivial. The quantities µf →y (0) ∀y ∈ Y can be computed jointly in O(|Y |) time by applying Observation 1. Therefore, all messages from a Leaky-OR factor node to its neighbouring variables nodes can be computed in O(|Y |) time. Noting the degree of the Leaky-OR factor node is |Y | + 1 completes the proof. 43 5.2.2 Message passing for Selection factors Recall the definitions of the Selection potential in Definition 13, and the general message passing equation for factors nodes to variable nodes given in Eqn. 5.2. We will show that exploiting the structure of the Selection potential allows one to compute all messages from a Selection factor node to its neighbouring variable nodes in time linear in the degree of the factor node. Below, let ΨSθ be a Selection potential and let f be the corresponding Selection factor node. As illustrated in Figure 4.1(a), the factor has neighbouring nodes N (f ) = Z ∪ {y}. We assume that P µu→f (xu ) > 0 and xu µu→f (xu ) = 1, ∀u ∈ N (f ) and xu ∈ {0, 1}. Theorem 25 All messages from a Selection factor node to all of its neighbouring variable nodes can be computed in time linear in the degree of the factor node. To prove Theorem 25, we require two lemmas. Lemma 26 The message passing equations µf →y (xy ) can be expressed as µf →y (0) ∝ Y µz→f (0) (5.23) z∈Z µf →y (1) ∝ Y z∈Z µz→f (0) X z∈Z θz µz→f (1) µz→f (0) (5.24) where the constants of proportionality are chosen so that µf →y (0) + µf →y (1) = 1. Lemma 27 The message passing equations µf →z (xz ) ∀z ∈ Z can be expressed as X  µv→f (1) µu→f (0) (µy→f (0) + µy→f (1) θv ) µv→f (0) u∈Z\z v∈Z\z Y µf →z (1) ∝ θz (µy→f (1) µu→f (0)) µf →z (0) ∝ Y u∈Z\z where the constants of proportionality are chosen so that µf →z (0) + µf →z (1) = 1. Proof of Lemma 26 (5.25) (5.26) 44 Substituting the form of the Selection potential defined in Definition 13 into the general message passing equation given in Eqn. 5.2 and noting that the quantity of interest is µf →y (xy ) yields: µf →y (xy ) ∝ X ΨSθ (xy , xZ ) xZ Y µz→f (xz ) (5.27) z∈Z where the summation is over all possible configurations of xZ . Consider the case of xy = 0: X µf →y (0) ∝ ΨSθ (0, xZ ) xZ Y = Y µz→f (xu ) (5.28) z∈Z µz→f (0). (5.29) z∈Z Consider the case of xy = 1: X µf →y (1) ∝ ΨSθ (1, xZ ) X θz µz→f (1) z∈Z = Y z∈Z µz→f (xu ) (5.30) z∈Z xZ :c(xZ )=1 = Y Y µu→f (0)  (5.31) u∈Z\z µz→f (0) X z∈Z θz µz→f (1) µz→f (0) (5.32) Proof of Lemma 27 Substituting the form of the Selection potential defined in Definition 13 into the general message passing equation given in Eqn. 5.2 and noting that the quantity of interest is µf →z (xz ), z ∈ Z yields: µf →z (xz ) ∝ XX xZ\z xy where P xZ\z ΨSθ (xy , xZ )µy→f (xy ) Y µu→f (xu ) u∈Z\z is a summation over all configurations of xZ\z . Consider the case xz = 0: (5.33) 45 X µf →z (0) ∝ X xZ\z :c(xZ\z )=0 xy X + Y ΨSθ (xy , xZ )µy→f (xy ) X ΨSθ (xy , xZ )µy→f (xy ) xZ\z :c(xZ\z )=1 xy Y = µy→f (0) µu→f (xu ) (5.34) u∈Z\z Y µu→f (xu ) u∈Z\z µu→f (0) (5.35) u∈Z\z X + Y µu→f (xu ) u∈Z\z xZ\z :c(xZ\z )=1 = µy→f (0) Y ΨSθ (1, xZ )µy→f (1) µu→f (0) (5.36) u∈Z\z +µy→f (1) X v∈Z\z Y = µy→f (0) Y θv (µv→f (1) µu→f (0)) u∈Z\{z,v} µu→f (0) (5.37) u∈Z\z Y +µy→f (1) µu→f (0) u∈Z\z Y =  X v∈Z\z  θv µv→f (1) µv→f (0) µu→f (0) (µy→f (0) + µy→f (1) u∈Z\z X µv→f (1) ) µv→f (0) (5.38) µu→f (xu ) (5.39) θv v∈Z\z Consider the case xz = 1: µf →z (1) ∝ X X ΨSθ (xy , xZ )µy→f (xy ) xZ\z :c(xZ\z )=0 xy = θz (µy→f (1) Y Y u∈Z\z µu→f (0)). (5.40) u∈Z\z Proof of Theorem 25 From Lemma 26, the messages µf →y (xy ), xy = {0, 1} can be computed in O(|Z|) time since it only requires computing a sum and product over simple quantities for each element of Z. From Lemma 27, the quantities µf →z (0)∀z ∈ Z can be computed jointly in O(|Z|) time by applying Observation 1. The quantities µf →z (1)∀z ∈ Z can also be computed jointly in O(|Z|)  Q time by applying Observation 1 to compute u∈Z\z µu→f (0) , and applying Observation 1 in the P µ (1) log domain to compute v∈Z\z θv µv→f . Therefore, all messages from a Selection factor node to v→f (0) 46 its neighbouring variables nodes can be computed in O(|Z|) time. Noting the degree of the Selection factor node is |Z| + 1 completes the proof. 5.2.3 Message passing for Berns factors Recall the definitions of the Berns potential in Definition 14, and the general message passing equation for factors nodes to variable nodes given in Eqn. 5.2. We will show that exploiting the structure of the Berns potential allows one to compute all messages from a Berns factor node to its neighbouring variable nodes in time linear in the degree of the factor node. The Berns factor can be expressed as a product of pairwise factors connecting the input binary variable and one of the output binary variables. Since the Berns factor can be expressed as a product of factors, it is intuitive that message computation can be performed in time linear in the degree of the Berns potential. For completeness, we will prove this result and derive the message passing equation for a Berns factor to its neighbouring variable nodes. Below, let ΨB θ be a Berns potential and let f be the corresponding Berns factor node. As illustrated in Figure 4.1(c), the factor has neighbouring nodes N (f ) = Z ∪ {y}. We assume that P µu→f (xu ) > 0 and xu µu→f (xu ) = 1, ∀u ∈ N (f ) and xu ∈ {0, 1}. Theorem 28 All messages from a Berns factor node to all of its neighbouring variable nodes can be computed in time linear in the degree of the factor node. To prove Theorem 28, we require two lemmas. Lemma 29 The message passing equations µf →y (xy ) can be expressed as µf →y (0) ∝ Y µz→f (0) (5.41) ((1 − θz )µz→f (0) + θz µz→f (1)) (5.42) z∈Z µf →y (1) ∝ Y z∈Z where the constants of proportionality are chosen so that µf →y (0) + µf →y (1) = 1. Lemma 30 The message passing equations µf →z (xz ) ∀z ∈ Z can be expressed as 47 Y µf →z (0) ∝ µy→f (0) µu→f (0) (5.43) u∈Z\z +(1 − θz )µy→f (1) Y ((1 − θu )µu→f (0) + θu µu→f (1)) u∈Z\z Y µf →z (1) ∝ θz µy→f (1) ((1 − θu )µu→f (0) + θu µu→f (1)) (5.44) u∈Z\z where the constants of proportionality are chosen so that µf →z (0) + µf →z (1) = 1. Proof of Lemma 29 Substituting the form of the Berns potential defined in Definition 14 into the general message passing equation given in Eqn. 5.2 and noting that the quantity of interest is µf →y (xy ) yields: µf →y (xy ) ∝ X ΨB θ (xy , xZ ) xZ Y µu→f (xu ) (5.45) u∈Z where the summation is over all possible configurations of xZ . Consider the case of xy = 0: µf →y (0) ∝ X ΨB θ (0, xZ ) xZ = Y Y µu→f (xu ) (5.46) u∈Z µz→f (0). (5.47) z∈Z Consider the case of xy = 1: µf →y (1) ∝ X ΨB θ (1, xZ ) xZ = Y µz→f (xz ) (5.48) z∈Z XY θzxz (1 − θz )1−xz µz→f (xz )  (5.49) xZ z∈Z = Y  (1 − θz )µz→f (0) + θz µz→f (1) . (5.50) z∈Z Proof of Lemma 30 Substituting the form of the Berns potential defined in Definition 14 into the general message passing equation given in Eqn. 5.2 and noting that the quantity of interest is µf →z (xz ), z ∈ Z yields: 48 XX µf →z (xz ) ∝ ΨB θ (xy , xZ )µy→f (xy ) xZ\z xy where P xZ\z Y µu→f (xu ) (5.51) u∈Z\z is a summation over all configurations of xZ\z . Consider the case xz = 0: µf →z (0) ∝ X ΨB θ (0, xZ )µy→f (0) xZ\z + Y  µu→f (xu ) (5.52) u∈Z\z X ΨB θ (1, xZ )µy→f (1) xZ\z Y  µu→f (xu ) u∈Z\z Y = µy→f (0) µu→f (0) (5.53) u∈Z\z Y +(1 − θz )µy→f (1) ((1 − θu )µu→f (0) + θu µu→f (1)). u∈Z\z Consider the case xz = 1: µf →z (1) ∝ X ΨB θ (0, xZ )µy→f (0) xZ\z + X Y  µu→f (xu ) (5.54) u∈Z\z ΨB θ (1, xZ )µy→f (1) xZ\z = θz µy→f (1) Y  µu→f (xu ) u∈Z\z Y ((1 − θu )µu→f (0) + θu µu→f (1)). (5.55) u∈Z\z Proof of Theorem 28 From Lemma 29, the messages µf →y (xy ), xy ∈ {0, 1} can be computed in time O(|Z|) since it only requires computing a product over easily computable quantities for each element of Z. From Lemma 30, the quantities µf →z (xz )∀z ∈ Z,xz ∈ {0, 1} can be computed jointly in O(|Z|) time by applying Observation 1. Therefore, all messages from a Berns factor node to its neighbouring variables nodes can be computed in O(|Z|) time. Noting the degree of the Berns factor node is |Z| + 1 completes the proof. 49 5.2.4 Proof of Theorem 20 Proof The result follows directly from Theorems 21, 22, 25, and 28. 5.3 Markov Chain Monte Carlo as an alternative to LBP In this thesis, we are mainly concerned with computing marginal probabilities such as the probability of the presence/absence of bricks in a given scene. As pointed out in Chapter 1, Markov Chain Monte Carlo (MCMC) techniques have been successfully used for inference in other grammar-based frameworks. A natural question is if MCMC is a feasible alternative to LBP for inference in the PSG framework. For the factor graphs we consider, simple MCMC schemes such as Gibbs sampling ([20]) and Metropolis-Hastings ([25]) are impractical. The main issue with applying Gibbs sampling or Metropolis-Hastings sampling here is that the factor graph under consideration may contain on the order of millions to billions of tightly coupled random variables. Because of the tight coupling, it is extremely difficult to design good move proposals for Metropolis-Hastings, and Gibbs sampling may be unable to mix since sampling a random variable from its conditional distribution is likely to result in no change in state. The crux of the issue is that due to the tight coupling of random variables, it is difficult to design a transition kernel that defines an irreducible Markov Chain yet also has a reasonable probability that proposed moves will be accepted. The sheer size of the factor graph exacerbates the problem of the MCMC chain mixing. Consider using Metropolis-Hastings MCMC as the inference engine for the toy grammar listed in Grammar 4 and consider the following situation. Let A = {(B, 3), (C, 3), (D, 3)} be a set of bricks. Considering the setting X(a) = 1, a ∈ A and X(a0 ) = 0 ∀a0 6∈ A. In this setting, the random variable X(C, 3) can have value 1 only if the brick (B, 3) generated it. Similarly, the random variable X(D, 3) can only have value 1 if the brick (C, 3) generated it. Now, consider proposing a new value for the random variable X(C, 3). Its state cannot be set to 0 since the brick (B, 3) will have no generated bricks, which is impossible under the model. Also, the brick (D, 3) will have no parent, which is also impossible under the model. Similarly, it not possible for Metropolis-Hastings to propose a different state for X(B, 3) or X(D, 3); Metropolis-Hastings is “stuck”. As we have demonstrated, for inference in the factor graph representing the model described in Grammar 4, serial MCMC techniques are too slow to be practical and will suffer from getting “stuck” in a particular state. To alleviate these difficulties, one may try block-move schemes such as Block Gibbs Sampling and Block Metropolis-Hastings whereby a large joint move involving the 50 random variables of multiple bricks is considered. However, the tight coupling of random variables will still pose a problem since for the factors graphs we consider, most joint assignments to random variables will be extremely unlikely or invalid. One could imagine designing effective model-specific block sampling schemes. But, as we are interested in a general, problem-agnostic framework for scene understanding, we wish to avoid inference schemes that are sensitive to the problem under consideration. It may be possible to use more sophisticated MCMC methods, such as Slice Sampling ([43]) which was effective in the Picture framework ([34]), Hybrid Monte Carlo([44]), or a combination of MCMC methods. Such an approach may be viable and is an interesting direction for future research. Chapter 6 Example grammars: Inference with LBP In this chapter we examine the results obtained by running LBP on the factor graph representation of the example grammars in Chapter 3. In the examples in this chapter, we condition on the presence/absence of a set of bricks in the scene and run LBP to convergence. We then compute approximate marginals for each unconditioned brick using Eqn. 5.3. 6.1 LBP computations with a curve grammar In this section we demonstrate the ability of the PSG framework to perform contour completion using Grammar 1. Figure 6.1 shows two examples of contour completions. In these experiments we condition on the presence of some INK bricks (shown in red) and compute the approximate marginal probabilities of the remaining INK bricks being present in the scene using LBP. As shown in the figure, the PSG framework is capable of completing gaps in contours. Note that in both contour completions, there is uncertainty in the precise completions. In the example in Figure 6.1(a), there is uncertainty as to whether the contour should be straight or if there are slight deviations along the contour in the vertical direction. Also, the PSG framework places non-trivial probability mass on the event that the contour continues on either side. In the example in Figure 6.1(b), the model captures uncertainty in the precise completion of the contour(s). Also, the PSG framework places little probability mass on the event that observed contours intersect and extend past the point of intersection. Instead, the PSG framework estimates that it is more likely that each observed contour “turns” into a neighbouring contour via a change of orientation. As these two examples demonstrate, the PSG framework uses some notion of context to perform contour completion. 51 52 (a) Contour gap completions. Note that the PSG framework expresses variability in the precise gap completions. Also, there is some uncertainty as to whether the contour extends to the left and to the right. (b) Completion of several contour gaps. Here, the PSG framework is capable of filling in gaps between observed contours around plausible intersection points. As with the example in Figure 6.1(a), there is variability over the precise completions. See text for discussion. Figure 6.1: A visualization of two contour completion examples. Each pixel represents an INK brick present at that location. The INK bricks conditioned to be present in the scene are denoted by a red pixel; all other bricks in the scene are unconditioned. The gray-scale values show the resulting approximate marginal probabilities computed by LBP. Darker pixels indicate a higher approximate marginal probability for the corresponding INK brick to be present in the scene. 53 6.2 LBP computations with a face grammar In this section we demonstrate the ability of the PSG framework to perform face and face part localization using Grammar 2. Note that the grammar forms a two level hierarchy in that FACE bricks generate EYE, NOSE, and MOUTH bricks. So, one can talk about top-down contextual information in the sense of a FACE brick providing context for EYE, NOSE, and MOUTH bricks, and bottom-up contextual information in the sense of EYE, NOSE, and MOUTH bricks providing context for a FACE brick. Figure 6.2 shows the results of inference after conditioning on the presence of three different sets of bricks in the scene. FACE EYE NOSE MOUTH Figure 6.2: A visualization of face and face part localization in various contexts. Each row is a different example where a different set of bricks was conditioned to be present in the scene. Each column represents a symbol of the grammar. Each pixel represents a brick at that pixel location. The bricks conditioned to be present in the image are denoted by a red pixel with a red arrow pointing to them. All other bricks in the scene are unconditioned. The gray-scale values are a visualization of the resulting approximate marginal probabilities of a brick being present in the scene as computed by LBP. Darker pixels indicate a higher approximate marginal probability to be present. For visualization purposes, a non-linear monotonic transformation was applied to the approximate marginal probabilities to enhance contrast. See text for an analysis of the results of inference. 54 Row 1 of Figure 6.2 shows the results of LBP when the FACE brick at the centre of the scene is conditioned to be present. The PSG framework performs top-down reasoning and posits possible locations for the face parts. Since this grammar model allows for variability in the location of the face parts, there is a region of plausible locations for each part. Row 2 of Figure 6.2 shows the results of LBP when the EYE brick at the centre of the scene is conditioned to be present. Here, the PSG framework performs bottom-up and top-down reasoning. First, since an EYE seldom appears on its own, the PSG framework infers that it is likely that there is a FACE present in the scene. This is an example of bottom-up reasoning. Due to the variability in the relative poses between a FACE and a constituent EYE, the framework is uncertain of the precise location of the FACE. Moreover, the distribution over possible FACE poses is bimodal since an EYE can either be the left eye or the right eye, so the framework reasons about both possibilities. For each FACE brick that may be present in the scene, the framework reasons about all of its constituent parts, hence the distribution over NOSE and MOUTH bricks is bimodal as well. This is an example of top-down reasoning. Note that to make the deduction that there is likely to be a NOSE or MOUTH in the scene based solely on observing an EYE requires a chain of reasoning that goes from bottom-up to infer the presence of a FACE, and then to top-down to infer the presence of the other FACE parts. Row 3 of Figure 6.2 shows the results of LBP when two EYE bricks that can both be generated by a single FACE brick are conditioned to be present in the scene. Note that the approximate marginal distribution of the FACE bricks present in the scene is trimodal. The two smaller modes correspond to the possibility that each EYE brick is generated by a different FACE brick, and so there may be two FACE bricks present in the scene. The centre mode corresponds to the possibility that both EYE bricks are generated by the same FACE brick. The face grammar used here indicates that the presence of any particular FACE brick in a scene is rare through a low self-rooting probability for FACE bricks. Because of this, the PSG framework places higher probability on the event that there is a single FACE brick present in the scene with both conditioned-on EYE bricks as constituent parts and a lower probability on the event that there are two FACE bricks present in the scene with each EYE brick being generated by a different FACE brick. The three examples in Figure 6.2 showcase the ability of the PSG framework to simultaneously perform top-down and bottom-up reasoning. Both kinds of reasoning are crucial to capture a notion of context. Note that there is no explicit notion of top-down nor bottom-up here, but such a concept naturally emerges due to the hierarchical nature of Grammar 2. We argue that the ability to reason contextually in a top-down and bottom-up fashion is crucial if one wishes to capture the notion of context and leverage it in scene understanding tasks. Here, we have demonstrated the ability of the PSG framework to capture a notion of context. 55 6.3 LBP computations with a binary segmentation grammar In this section, we analyze the capability of the PSG framework to reason about binary image segments using Grammar 3. Figure 6.3 shows the results of inference on three examples scenes. (a) (b) (c) Figure 6.3: A visualization of three binary image segmentation examples. Each pixel represents an FG brick at that location. FG bricks conditioned to be present in the scene are denoted by a red pixel. FG bricks conditioned to be absent in the scene are denoted by a blue pixel. All other bricks in the scene are unconditioned. The gray-scale values show the resulting approximate marginal probabilities computed by LBP. Darker pixels indicate a higher approximate marginal probability to be present in the scene. See text for discussion. Best viewed in colour. In Figure 6.3(a), we condition on the presence of the FG brick in the centre of the scene. Its presence influences other FG bricks near it to be present in the scene as well, with its influence decreasing with distance. The most probable interpretation of the scene under this grammar is that the SEED brick chose the FG brick indicated by the red pixel to be present, and the chosen FG brick potentially generates other FG bricks. However, another possible scene interpretation is that the SEED brick chose some other FG brick to be present, and through the generative process, the FG brick indicated by the red pixel was generated. The latter event has many ways to occur since the SEED brick could have chosen any other FG brick to start the generative process. The PSG framework reasons about both of these scene interpretations. In Figure 6.3(b), we condition on a set of FG bricks being present in the scene. The FG bricks that are conditioned to be present form the boundary of a square in the scene. Here, there are many possible scene interpretations since the SEED brick could have chosen any of the FG bricks conditioned to be present to start the generative process. The results of inference suggests that there is a high probability that the FG bricks inside the square are present. Outside the square, the approximate marginal probabilities computed by LBP decay rapidly with distance from the square’s boundary. Note that the approximate marginal probability for the FG brick at the centre of the square 56 is higher than most of the FG bricks outside of the square. In fact, the set of FG bricks outside the square and more than 2 pixels from the square’s boundary have lower approximate marginal probability to be present in the scene than the centre FG brick, despite the centre FG brick being 16 pixels away from the square’s boundary. This suggests that the PSG framework is capable of “filling in” shapes. In Figure 6.3(c), we condition on a set of FG bricks being present in the scene, and condition on another set of FG bricks being absent in the scene. Both sets of FG form squares in the scene. Note that this example is similar to the example shown in Figure 6.3(b), except now we also condition on the absence of a set of bricks. Recall that Grammar 3 generates scenes with a single, non-empty connected foreground component. So, under the distribution over scenes induced by Grammar, 3, it is impossible for FG bricks outside the blue square to be present in the scene since some FG bricks inside the blue square must be present. The approximate marginals estimated by LBP reflect this fact. However, LBP is not always able to reason about such global constraints correctly. Figure 6.4 shows another run of LBP using the same set of conditioned on bricks as the example in Figure 6.3(c). Here, LBP produces very different results. Recall from Chapter 5 that LBP requires an initialization of messages. In the example shown in Figure 6.4, the messages of LBP were initialized to favour the presence of all unconditioned FG in the scene. In contrast, in the example shown in Figure 6.3(c), the messages of LBP were initialized to discourage the presence of any unconditioned FG bricks in the scene. In Figure 6.4, the approximate marginals computed by LBP are inconsistent with the distribution over scenes induced by Grammar 3 since under the distribution over scenes induced by Grammar 3, it is impossible for any FG bricks outside of the blue square to be present. There are three causes that contribute to the mismatch between the approximate marginals produced by LBP and the distribution over scenes induced by Grammar 3 in the example shown in Figure 6.4. The first issue is numerical. In our implementation of LBP, messages are constrained to be non-zero to avoid numerical issues. If one uses Eqn. 5.3 to compute approximate marginal probabilities, then it is impossible to have a marginal probability be exactly 0 since that requires at least one of the LBP messages to be exactly 0. Hence, in our implementation of LBP, we cannot capture the notion that it is impossible for some set of bricks to be present in the scene. The second issue is that the distribution over scenes represented by the factor graph constructed from Grammar 3 using Definition 16 is different than the distribution over scenes represented by the original grammar. The factor graph construction assumes an acyclic grammar, but Grammar 3 is cyclic, and hence the distributions over scenes are not the same. Since we perform inference using the constructed factor graph, we are performing inference with a distribution over scenes that is related to but different from the one induced by Grammar 3. 57 Figure 6.4: A visualization of a binary image segmentation task where the approximate marginals computed by LBP are inconsistent with the underlying grammar model. The gray-scale values show the resulting approximate marginal probabilities computed by LBP. Darker pixels indicate a higher approximate marginal probability to be present in the scene. The set of FG bricks conditioned on is the same as in Figure 6.3(c), but here the messages of LBP have been initialized to favour the presence of all FG bricks in the scene. Note that as in Figure 6.3(c), according to the generative process, it is impossible for any FG brick outside of the blue square to be present. However, in this case LBP incorrectly reasons about this constraint. The third issue is that LBP is a heuristic for performing approximate inference in loopy factor graphs. Although in practice LBP seems to produce reasonable approximations to marginal quantities for many tasks (see [42, 33, 14]), there is no guarantee it will work well for arbitrary tasks and factor graphs. Empirically, LBP has difficulty producing accurate approximations when the underlying factor graph contains many loops, as is the case here. Chapter 7 Connections to Pictorial Structures In this chapter, we elucidate the connections between the PSG framework and Pictorial Structures described in [13]. In particular, we show that the prior over scenes that a Pictorial Structure (PS) model defines can be expressed as a PSG as described in Chapter 2, but the reverse is not true. Thus, the PSG representation can be viewed as a generalization of the PS model representation. Also, the graphical model representation of a PS model differs significantly from the factor graph representation used in the PSG framework. The difference in graphical model representation has consequences on the accuracy and speed of inference. Namely, the PS graphical model allows for efficient and exact maximum a posteriori (MAP) estimation via dynamic programming and generalized distance transforms. Recall from Chapter 5 that in contrast, the PSG framework employs the approximate inference scheme of LBP which, in practice, is slower than the exact inference schemed used in the PS framework. 7.1 Pictorial Structures: Overview We first describe the PS model, as presented in [13]. A PS model represents objects as a collection of parts and connections between parts. A PS model can be represented as an undirected graph G = (V, E), where V = {v1 , . . . vn } represents a set of objects/parts, and E is a set of pairs {vi , vj }. A configuration of parts is given by L = {l1 , . . . ln } where li specifies the pose of part vi in the scene. Poses may correspond to pixel coordinates, for example. Importantly, a PS model implicitly assumes that there is one instance of each object in the scene since li represents a single pose for part vi . To model the geometric relationship between parts, a PS model has pairwise terms dij (li , lj ) that measures the degree of disagreement between the placement of parts vi and vj at locations li and 58 59 lj , respectively. Finally, given an image Y , the term mi (li , Y ) is a cost for placing the object vi at location li based on the image evidence. The energy of a configuration L given an image Y is defined to be X F (L, Y ) = dij (li , lj ) + n X mi (li , Y ). (7.1) i=1 {vi ,vj }∈E The energy in Eqn. 7.1 defines a probability 1 p(L, Y ) = Z Y e −dij (li ,lj ) n  Y  e−mi (li ,Y ) . (7.2) i=1 {vi ,vj }∈E Eqn. 7.2 can be viewed as a conditional probability of L given Y defined by a product of a prior over scenes, p(L), and an image likelihood p(Y | L), where p(L) ∝ Y e−dij (li ,lj ) (7.3) {vi ,vj }∈E p(Y | L) ∝ n Y e−mi (li ,Y ) . (7.4) i=1 In [13], inference is performed using maximum a posteriori (MAP) by minimizing Eqn. 7.1. For efficient inference, G = (V, E) is constrained to be a tree and the dij must be expressible as a Mahalanobis distance between locations in a transformed space. These constraints allow for exact MAP estimation using dynamic programming and generalized distance transforms. [13] also discusses the computation of marginal distributions. Here, we will focus on the MAP setting but the connections outlined in this chapter are equally applicable to the setting where marginals are computed. 7.2 Expressing a Pictorial Structure model as a PSG In this section, we show how to represent the prior over scenes defined by a PS model, p(L), in the language of a PSG as in Definition 1. We describe the construction of a PSG from an arbitrary PS model below. Since a PS model is restricted to be tree-structured, without loss of generality, we take the root of the tree to be v1 and we assume that the vi are ordered with a breadth-first search starting at v1 . Definition 31 Construction process for transforming a PS model into a PSG: 60 1. Represent each object/part vi ∈ V in the PS model as a symbol in the PSG. For simplicity, we will use vi 1 ≤ i ≤ n as symbols in the PSG. 2. Let Li be the set of possible values for li , 1 ≤ i ≤ n. Define the pose space of vi to be Li . 3. For each vi , create one production rule with vi in the LHS and include vj in the RHS if {i, j} ∈ E and j > i. Assign this rule probability 1. Since each symbol vi appears only once in the LHS of the set of production rules, without loss of generality, assume rule r has vr in the LHS. 4. Set all conditional pose distributions to be Categorical distributions. For all pairs {vi , vj } where {vi , vj } ∈ E and j > i, set the parameter θ(li ,i,k,lj ) ∝ e−dij (li ,lj ) for li ∈ Li and lj ∈ Lj . The constant of proportionality is chosen so that the set of parameters θ(li ,i,k) sums to 1. 5. Set vi = 0, 1 ≤ i ≤ n. 6. Introduce a symbol vSEED . This symbol will be used to ensure that there is exactly one instance of v1 in the scene. 7. Set the pose space of vSEED to be [1]. 8. Set vSEED = 1. 9. Create a production rule with vSEED in the LHS and only v1 in the RHS. Without loss of generality, assume this rule is the (n + 1)th rule. 10. Set γ(n+1,1) to be a Uniform distribution over the elements of L1 . 11. Set all rule probabilities to 1. Following the construction above yields a PSG that describes a prior over scenes that matches an arbitrary PS model’s prior over scenes. Although the prior over scenes that a PS model defines can be represented as a PSG, the reverse is not true. Below are several aspects of a PSG model that cannot be expressed in a PS model. • A PSG can have multiple instances of each object in the scene. • The conditional pose distributions in a PSG can be an IndBern distribution. • A PSG can have multiple possible compositions for a given object, with each composition having a different probability of occurring. 61 • The grammar of a PSG need not be tree-structured. Since a PS model’s prior over scenes, p(L), can be represented as a PSG but the reverse is not true, we say that the PSG representation described in Chapter 2 is a generalization of the PS model representation. 7.2.1 Example construction In this subsection, we give a concrete example of using the construction described in Definition 31 to represent a PS model as a PSG . V2 V1 V4 V5 V3 Figure 7.1: Undirected graphical model representation for a PS model where V = {v1 , . . . , v5 }. Figure 7.1 represents the undirected graphical model for a PS model where V = {v1 , . . . , v5 } and the state space of vi is Li . Following the construction process described in Definition 31, the corresponding PSG is given in Grammar 5. Grammar 5 A PSG representation of the PS model in Figure 7.1: Σ = {v1 , . . . , v5 , vSEED }. Ωvi = Li , 1 ≤ i ≤ 5. ΩvSEED = [1]. Rules: 1.0, (v1 , l1 ) → (v2 , Categorical(· | θ(l1 ,1,1) ), (v3 , Categorical(· | θ(l1 ,1,2) ), (v4 , Categorical(· | θ(l1 ,1,3) ), 1.0, (v2 , l2 ) →∅ 1.0, (v3 , l3 ) →∅ 1.0, (v4 , l4 ) → 1.0, (v5 , l5 ) →∅ 1.0, (vSEED , 1) → vi = 0, 1 ≤ i ≤ 5, vSEED = 1 (v5 , Categorical(· | θ(l4 ,4,1) ), (v1 , Uniform(L1 )) 62 Grammar 5 induces a distribution over scenes where each object/part vi ∈ {v1 , . . . , v5 } appears exactly once in the scene, and the probability of placing vj at lj given that object vi is placed at li is proportional to e−dij (li ,lj ) for {vi , vj } ∈ E and j > i. This prior over scenes induced by the PSG in Grammar 5 matches the prior over scenes induced by the PS model depicted in Figure 7.1. 7.3 Pictorial Structures vs. PSG : graphical models and inference Inference in the PS framework differs substantially from inference in the PSG framework. The differences in inference can be attributed to the underlying graphical model representations. For a PS model, the graphical model representation is a tree-structured Markov random field (MRF). Examining the Eqn. 7.1, one takes the li , 1 ≤ i ≤ n, to be the random variables of the MRF, and the terms e−dij (li ,lj ) and e−mi (li ,Y ) correspond to the pairwise and unary potentials, respectively. This formulation is significantly different than the PSG factor graph. The key differences are that 1) the PS graphical model is tree-structured while the PSG graphical model is not tree-structured, and 2) the PS graphical model typically has a few random variables with large state spaces while the PSG graphical model typically has a large number of binary random variables. These differences have implications for both the accuracy and speed of inference. For PS, MAP estimation with its graphical model is exact and fast since one can use dynamic programming and generalized distance transforms. On the other hand, if one were to perform MAP estimation with the PSG factor graph, 1 inference would only be approximate and slower than inference in the PS framework since the PSG framework uses LBP in a loopy graph. As shown in Section 7.2, a PS model can be expressed in the PSG framework, but the reverse is not true; the set of models expressible in the PS framework is a proper subset of the models expressible in the PSG framework. However, the cost for the additional modeling power of the PSG framework is that inference is only approximate and slower than in the PS framework. 1 Although we use LBP to compute marginals as stated in Chapter 5, we could use max-product LBP to perform inference in a MAP setting. Chapter 8 Learning Model Parameters Recall from Definition 1 that a PSG is defined by a 6-tuple G = (Σ, Ω, R, q, γ, ). The chief learning problem we consider in this work is estimating model parameters q, γ, and . We first formalize the parameter estimation problem, briefly describe the Expectation-Maximization (EM) algorithm, and finally describe a modification to the EM algorithm so that it can be applied in the PSG framework. The modification is to replace exact posterior quantities required by the Maximization-step of the EM algorithm with approximate posterior quantities by computed LBP. The general idea of replacing the exact posteriors by approximate posteriors computed by LBP is related to the work of [26]. The learning algorithm described in this chapter can be thought of as an approximation variant of the EM algorithm. 8.1 Maximum likelihood estimation Consider a set of n datapoints D = {D1 , · · · , Dn } that are independent and identically distributed. Let Φ be the set of model parameters. The data likelihood function is n Y p(D | Φ) = p(Di | Φ). (8.1) i=1 In the maximum likelihood setting, the goal is to find a setting for Φ that maximizes the data likelihood, or equivalently, the data log-likelihood. Formally, we seek to solve Φ∗ = arg max log p(D | Φ) (8.2) Φ = arg max Φ n X i=1 63 log p(Di | Φ). (8.3) 64 Suppose that the probabilistic model under consideration has hidden variables Z = {Z1 , · · · , Zn } where Zi is the set of hidden variables associated with datapoint Di . The hidden variables can represent missing observations, or variables that cannot be directly observed. The joint distribution p(D, Z | Φ) is commonly called the complete-data likelihood. With hidden variables, Eqn. 8.3 can be written as Φ∗ = arg max Φ n X log i=1 X  p(Di , Zi | Φ) . (8.4) Zi Unfortunately, solving Eqn. 8.4 exactly in the general case is intractable. Fortunately, the EM algorithm is specifically designed to address the maximum-likelihood estimation problem given in Eqn. 8.3. We describe the EM algorithm in the next section. 8.2 EM algorithm Recall that the EM algorithm first described in [10] can be applied to maximum-likelihood estimation problems with hidden variables. When used to solve 8.4, the EM algorithm produces a locally optimal solution for Φ. Generally, the EM algorithm is an iterative algorithm that alternates between computing the posterior distribution over hidden variables given a setting of the model parameters, and computing a setting of the model parameters given a posterior distribution over hidden variables. The two alternating steps are called the Expectation-step (E-step) and the Maximization-step (M-step), respectively. Definition 32 Let Z denote the hidden variables of a probabilistic model, let D denote the observed data, and let Φ denote the model parameters. The E-step of the EM algorithm computes the posterior distribution p(Z | D, Φ). The M-step of EM algorithm makes use of the expectation of the complete-data likelihood under the posterior distribution over hidden variables. This expectation is commonly called the Q-distribution and is defined below. Definition 33 The Q-distribution is defined as Q(Φ0 , Φ) = Ep(Z|D,Φ) [log p(D, Z | Φ0 )]. (8.5) 65 Definition 34 Let Φ(t) be the set of model parameters at EM iteration t. The M-step of the EM algorithm involves solving the optimization problem Φ(t+1) = arg max Q(Φ0 , Φ(t) ). (8.6) Φ0 In the EM algorithm, the model parameters are first initialized to some starting value. Then, the algorithm alternates between performing the E-step and M-step described in Definitions 32 and 34, respectively. The algorithm is guaranteed to converge (see [10] for proof) and the resulting solution for the model parameters Φ are taken as an approximate solution to maximum-likelihood estimation problem. Although the EM algorithm is not guaranteed to find the global optimal solution for Φ, it is guaranteed to find a locally optimal solution. 8.3 Applying EM to the PSG framework In the PSG framework, we seek to fit the model parameters Φ = {q, , γ} (fitting the conditional pose distributions γ entails fitting the set of parameters θ which govern those distributions). Here, we assume that the PSG is acyclic and work with its factor graph representation, as described in Section 4.2. In our setting, each datapoint Di corresponds to an image. Note that we differentiate between an image and a scene; a scene is a description of the image that contains higher level information, such as what objects are present in the image and what are the compositional relationships between them. Denote by Xi (A, ω) the random variable X(A, ω) associated with scene i. Define Ri (A, ω) and Ci (A, ω) analogously. Define the following sets of random variables: Xi = {Xi (A, ω) | A ∈ Σ, ω ∈ ΩA } Ri = {Ri (A, ω) | A ∈ Σ, ω ∈ ΩA } Ci = {Ci (A, ω) | A ∈ Σ, ω ∈ ΩA } Zi = {Xi , Ri , Ci }. We have implicitly assumed that all scenes have the same set of bricks, which may not be true in practice. For example, scenes may be different sizes and so the pose spaces of the symbols may differ. The results in this chapter can be modified to accommodate scenes of varying sizes. For simplicity, we assume below that all scenes are the same size. Generally, the PSG framework contains hidden variables. For example, although a FACE brick may be present in a scene, there may be no direct image evidence as to which rule was chosen 66 to expand that FACE brick. In general, the PSG framework treats the set of random variables Z = {Z1 , · · · Zn } as hidden variables. Estimating the parameters Φ can be formulated in the maximum likelihood setting with hidden variables, as in Eqn. 8.4. Below, we first outline the M-step updates assuming that the posterior distribution computed in the E-step is available. We then outline a modification to the EM algorithm’s E-step whereby computation of the exact posterior distribution is replaced by computation of an approximation to the posterior. 8.3.1 M-step In this subsection, given a set of model parameters Φ(t) , we show how to solve for the updated model parameters Φ(t+1) , as in Eqn. 8.6. We assume that the posterior quantity p(Z | D, Φ(t) ) has been computed in the E-step (see next subsection). Since we assume the Di ’s are independent and identically distributed, the Q-distribution can be expressed as 0 (t) Q(Φ , Φ ) = n X Ep(Zi |Di ,Φ(t) ) [log p(Zi | Φ0 ) + log p(Di | Zi , Φ0 )] (8.7) i=1 where p(Di | Zi , Φ0 ) is the likelihood of the i-th datapoint and p(Zi | Φ0 ) is the prior distribution over scenes defined by the PSG factor graph. We assume that the PSG model parameters do not appear in the likelihood p(Di | Zi , Φ0 ). Proposition 35 Let par(Xi (A, ω)) = 0 denote the setting in which C = 0 ∀C ∈ par(Xi (A, ω)). The M-step update for the self-rooting parameters A , A ∈ Σ, is n P P A = p(Xi (A, ω) = 1, par(Xi (A, ω) = 0) | Di , Φ(t) ) i=1 ω∈ΩA n P P . P p(Xi (A, ω) = x, par(Xi (A, ω) = 0) | Di (8.8) , Φ(t) ) i=1 ω∈ΩA x∈{0,1} Proposition 36 Let q(A,r) denote the probability of choosing rule r ∈ RA . The M-step update for the rule selection probability q(A,r) , A ∈ Σ, r ∈ RA is n P P q(A,r) = p(Xi (A, ω) = 1, Ri (A, ω) = I(Er ) | Di , Φ(t) ) i=1 ω∈ΩA n P P P i=1 ω∈ΩA r0 ∈RA . p(Xi (A, ω) = 1, Ri (A, ω) = I(Er0 ) | Di , Φ(t) ) (8.9) 67 where E denotes a set of binary random variables indexed by RA , and I(Er ) denotes the setting Er = 1, Er0 = 0 ∀r0 6= r (i.e., I(Er ) indicates that rule r was selected). To prove the propositions above, we require the following lemma: Lemma 37 In the the PSG framework, the Q-distribution can be expressed in the form Q(Φ0 , Φ(t) ) = n X Ep(Zi |Di ,Φ(t) ) [log p(Di | Zi , Φ0 )] (8.10) i=1 + + + n X X h i Ep(Xi (A,ω),par(Xi (A,ω))|Di ,Φ(t) ) log ΨL (x , x ) A par(Xi (A,ω)) Xi (A,ω) i=1 (A,ω)∈B n X X h i Ep(Xi (A,ω),Ri (A,ω)|Di ,Φ(t) ) log ΨSqA (xXi (A,ω) , xRi (A,ω) ) i=1 (A,ω)∈B n X X Ep(Ri (A,ω),Ci (A,ω)|Di ,Φ(t) ) i=1 (A,ω)∈B h X i log p(Ci (A, ω, r, j) | Ri (A, ω, r), Φ0 ) . r∈RA 1≤j≤nr Proof of Lemma 37 Recall from Eqn. 4.4 that we express the prior distribution over scenes in a factorized form. Substituting Eqn. 4.4 into Eqn. 8.7 yields Q(Φ0 , Φ(t) ) = n X Ep(Zi |Di ,Φ(t) ) [log p(Di | Zi , Φ0 )] (8.11) i=1 + n X Ep(Zi |Di ,Φ(t) ) i=1 + n X + i=1 Simplifying, i log p(Xi (A, ω) | par(Xi (A, ω)), Φ0 ) (A,ω)∈B Ep(Zi |Di ,Φ(t) ) i=1 n X h X h X log p(Ri (A, ω) | Xi (A, ω), Φ0 ) i (A,ω)∈B Ep(Zi |Di ,Φ(t) ) h X (A,ω)∈B r∈RA 1≤j≤nr i log p(Ci (A, ω, r, j) | Ri (A, ω, r), Φ0 ) . 68 Q(Φ0 , Φ(t) ) = n X Ep(Zi |Di ,Φ(t) ) [log p(Di | Zi , Φ0 )] (8.12) i=1 + n X Ep(Xi ,Ci |Di ,Φ(t) ) i=1 + n X + i log p(Xi (A, ω) | par(Xi (A, ω)), Φ0 ) (A,ω)∈B Ep(Xi ,Ri |Di ,Φ(t) ) i=1 n X h X h X i log p(Ri (A, ω) | Xi (A, ω), Φ0 ) (A,ω)∈B Ep(Ri ,Ci |Di ,Φ(t) ) i=1 h X i log p(Ci (A, ω, r, j) | Ri (A, ω, r), Φ0 ) . (A,ω)∈B r∈RA 1≤j≤nr The conditional terms p(Xi (A, ω) | par(Xi (A, ω)), Φ0 ) and p(Ri (A, ω) | Xi (A, ω), Φ0 ) can be expressed in terms of a Leaky-OR potential (Definition 12) and a Selection potential (Defintion 13), respectively, using a subset of the model parameters. Substituting the form of the potential functions yields Q(Φ0 , Φ(t) ) = n X Ep(Zi |Di ,Φ(t) ) [log p(Di | Zi , Φ0 )] (8.13) i=1 + n X Ep(Xi ,Ci |Di ,Φ(t) ) i=1 + n X + log ΨL A (xpar(Xi (A,ω)) , xXi (A,ω) ) i (A,ω)∈B Ep(Xi ,Ri |Di ,Φ(t) ) i=1 n X h X h X log ΨSqA (xXi (A,ω) , xRi (A,ω) ) i (A,ω)∈B Ep(Ri ,Ci |Di ,Φ(t) ) i=1 Using the linearity of expectations, h X (A,ω)∈B r∈RA 1≤j≤nr i log p(Ci (A, ω, r, j) | Ri (A, ω, r), Φ0 ) . 69 0 (t) Q(Φ , Φ ) = n X Ep(Zi |Di ,Φ(t) ) [log p(Di | Zi , Φ0 )] (8.14) i=1 + n X X h i Ep(Xi (A,ω),par(Xi (A,ω))|Di ,Φ(t) ) log ΨL A (xpar(Xi (A,ω)) , xXi (A,ω) ) i=1 (A,ω)∈B n X X h + Ep(Xi (A,ω),Ri (A,ω)|Di ,Φ(t) ) log ΨSqA (xXi (A,ω) , xRi (A,ω) ) i=1 (A,ω)∈B n h X X X + Ep(Ri (A,ω),Ci (A,ω)|Di ,Φ(t) ) i=1 (A,ω)∈B i i log p(Ci (A, ω, r, j) | Ri (A, ω, r), Φ0 ) . r∈RA 1≤j≤nr Proof of Proposition 35 The M-step involves optimizing the Q-distribution with respect to the model parameters. Consider setting the self-rooting parameters A , A ∈ Σ, to optimize the factorized Q-distribution given in Eqn. 8.10. From the definition of the Leaky-OR potential in Definition 12, the self-rooting parameter is used only when all the input values are zero. Hence, in fitting A , we only need to consider the case par(Xi (A, ω)) = 0, 1 ≤ i ≤ n, ω ∈ ΩA . Substituting the form of the Leaky-OR potential for the case par(Xi (A, ω)) = 0 into Eqn. 8.10 and taking the partial derivative with respect to A , ∂Q(Φ0 , Φ(t) ) = ∂A n X X X (8.15) i=1 ω∈ΩA x∈{0,1} x 1−x  − p(Xi (A, ω) = x, par(Xi (A, ω) = 0) | Di , Φ(t) ) A 1 − A where we have written the form of the expectation under the posterior explicitly. Setting the derivative to zero and solving for A yields n P A = p(Xi (A, ω) = 1, par(Xi (A, ω) = 0) | Di , Φ(t) ) P i=1 ω∈ΩA n P P P i=1 ω∈ΩA x∈{0,1} . p(Xi (A, ω) = x, par(Xi (A, ω) = 0) | Di , Φ(t) ) (8.16) 70 Proof of Proposition 36 Consider setting the parameter qA to optimize the factorized Q-distribution given in Eqn. 8.10. From the definition of the Selection potential in Definition 13, the selection probabilities are used only when the binary input value is 1. Hence, in fitting qA , we need only consider the case where Xi (A, ω) = 1, 1 ≤ i ≤ n, ω ∈ ΩA . Recall that qA represents rule selection probabilities, so we have the constraint P r∈RA q(A,r) = 1. Hence, optimizing the Q-distribution with respect to qA is a constrained optimization problem. Recall that the method of Lagrange multipliers is a method that allows one to find local maxima/minima of a function subject to equality constraints. Using the method of Lagrange multipliers, we seek to maximize the Lagrange function L(Φ0 , Φ(t) ): L(Φ0 , Φ(t) ) = Q(Φ0 , Φ(t) ) − λ( X q(A,r) − 1). (8.17) r∈RA Taking the partial derivative of Eqn. 8.17 with respect to q(A,r) , ∂L(Φ0 , Φ(t) ) ∂Q(Φ0 , Φ(t) ) = − λ. ∂q(A,r) ∂q(A,r) (8.18) Substituting the form of the Selection potential for the case Xi (A, ω) = 1 into Eqn. 8.10, and taking the partial derivative with respect to q(A,r) , ∂Q(Φ0 , Φ(t) ) ∂q(A,r) = n X X p(Xi (A, ω) = 1, Ri (A, ω) = I(Er ) | Di , Φ(t) ) i=1 ω∈ΩA Now, plug-in Eqn. 8.19 into Eqn. 8.18, set ∂L(Φ0 ,Φ(t) ) ∂q(A,r)  1  .(8.19) q(A,r) = 0, and solve for q(A,r) . This yields the solution n P q(A,r) = p(Xi (A, ω) = 1, Ri (A, ω) = I(Er ) | Di , Φ(t) ) P i=1 ω∈ΩA n P P P i=1 ω∈ΩA r0 ∈RA . p(Xi (A, ω) = 1, Ri (A, ω) = I(Er0 ) | Di (8.20) , Φ(t) ) We now detail how to fit the conditional pose distributions γ, which can either represent a Categorical distribution or an IndBern distribution. In both cases, distributions γ are governed by parameters θ. 71 Below, we assume that the operation of subtraction is defined on the pose spaces ΩA ∈ Ω. For example, a pose ω ∈ ΩA for a brick could represent a vector. We use the notation K(r,j) = {∆ | ∆ = ω − z, ω ∈ ΩA(r,0) , z ∈ ΩA(r,j) }. K(r,j) represents the set of possible differences between the pose of a brick of type A(r,0) and the pose of a brick of type A(r,j) for a rule r and the j-th symbol in the RHS of the rule. In practice, it may be helpful to tie together parameters of the model to represent invariances in the model. One such invariance often useful in computer vision is shift-invariance. For example, consider selecting the location for the mouth of a face; the distribution over the location of the mouth may be most naturally expressed relative to the centre of the face. Below, we derive updates for θ when the conditional pose distributions they parameterize are shift-invariant. We can reparameterize the parameters θ to represent shift invariance. Consider the parameters θ(ω,r,j) , r ∈ R, 1 ≤ j ≤ nr , ω ∈ ΩA(r,0) . We tie together the set of parameters of {θ(ω,r,j) | ω ∈ ΩA(r,0) } so that θ(ω,r,j,z) = θ(ω0 ,r,j,z 0 ) whenever (ω − z) = (ω 0 − z 0 ). Now, associate with each ∆ ∈ K(r,j) a parameter θ̂(∆,r,j) . The parameters θ(ω,r,j) can be written in terms of θ̂(∆,r,j) . Concretely, θ(ω,r,j,z) = θ̂(∆,r,j) where ∆ = ω − z. Note that a setting for the parameters θ̂ implies a setting for the parameters θ. Proposition 38 Suppose the set of conditional pose distributions {γ(ω,r,j) | ω ∈ ΩA(r,0) }, r ∈ R, 1 ≤ j ≤ nr , is a set of shift-invariant Categorical distributions. This implies that the conditional pose distributions p(Ci (A, ω, r, j) | Ri (A, ω, r), Di , Φ(t) ), ω ∈ ΩA(r,0) , r ∈ R, 1 ≤ j ≤ nr are represented by Selection potentials. The M-step update for the parameter θ̂(∆,r,j) is given by n P θ̂(∆,r,j) = P P p(Ci (A, ω, r, j) = I(Ez ), | Ri (A, ω, r) = 1, Di , Φ(t) ) i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ n P P P P ∆0 ∈K i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆0 p(Ci (A, ω, r, j) = I(Ez ), | Ri (A, ω, r) = 1, Di , Φ(t) ) (8.21) where E denotes a set of binary random variables indexed by ΩA(r,j) and I(Ez ) denotes the setting Ez = 1, Ez 0 = 0 ∀z 0 6= z (i.e., I(Ez ) indicates that pose z was selected). Proposition 39 Suppose the set of conditional pose distributions {γ(ω,r,j) | ω ∈ ΩA(r,0) }, r ∈ R, 1 ≤ j ≤ nr , is a set of shift-invariant IndBern distributions. This implies that the conditional pose distributions p(Ci (A, ω, r, j) | Ri (A, ω, r), Di , Φ(t) ), ω ∈ ΩA(r,0) , r ∈ R, 1 ≤ j ≤ nr , are represented by a Berns potential. The M-step update for the parameter θ̂(∆,r,j) is given by 72 n P P P p(Ci (A, ω, r, j, z) = 1, | Ri (A, ω, r) = 1, Di , Φ(t) ) i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ θ̂(∆,r,j) = n P P . 1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ (8.22) Proof of Proposition 38 From the definition of the Selection potential in Definition 13, the parameters θ(ω,r,j) (and so θ̂(∆,r,j) ), are used only when the binary input value is 1. Hence, we need only consider cases when Ri (A, ω, r) = 1, 1 ≤ i ≤ n. Here, the set of parameters {θ̂(∆,r,j) | ∆ ∈ K(r,j) } represents the selection probabilities of P a Categorical distribution, so we have the constraint θ̂(∆,r,j) = 1. Hence, optimizing the ∆∈K Q-distribution with respect to θ̂(∆,r,j) can be expressed as a constrained optimization problem. Similar to updating the parameters q, we employ the method of Lagrange multipliers to maximize the Lagrange function L(Φ0 , Φ(t) ) = Q(Φ0 , Φ(t) ) − λ( X θ̂(∆,r,j) − 1). (8.23) ∆∈K(r,j) Taking the partial derivative of Eqn. 8.23 with respect to θ̂(∆,r,j) , ∂L(Φ0 , Φ(t) ) ∂ θ̂(∆,r,j) = ∂Q(Φ0 , Φ(t) ) ∂ θ̂(∆,r,j) − λ. (8.24) Substituting the form of the Selection potential for the case Ri (A, ω, r) = 1 into Eqn. 8.10, and taking the partial derivative with respect to θ̂(∆,r,j) , ∂Q(Φ0 , Φ(t) ) ∂ θ̂(∆,r,j) = n X X p(Ci (A, ω, r, j) = I(Ez ), | Ri (A, ω, r) = 1, Di , Φ(t) ) X i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ θ̂(∆,r,j) (8.25) where we have written the form of the expectation under the posterior explicitly. Now, plug-in Eqn. 8.25 into Eqn. 8.24, set ∂L(Φ0 ,Φ(t) ) ∂ θ̂(∆,r,j) = 0, and solve for θ̂(∆,r,j) . This yields the solution 73 n P θ̂(∆,r,j) = P P p(Ci (A, ω, r, j) = I(Ez ), | Ri (A, ω, r) = 1, Di , Φ(t) ) i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ n P P P P . ∆0 ∈K i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆0 p(Ci (A, ω, r, j) = I(Ez ), | Ri (A, ω, r) = 1, Di , Φ(t) ) (8.26) Proof of Proposition 39 From the definition of the IndBern potential in Definition 14, the parameters θ(ω,r,j) (and so θ̂(∆,r,j) ), are only used when the binary input value is 1. Hence, we need only consider cases when Ri (A, ω, r) = 1, 1 ≤ i ≤ n. Unlike in Proposition 38, the parameters here do not have a constraint. Hence, we can directly optimize the Q-distribution with respect to θ̂(∆,r,j) . Substituting the form of the Berns potential for the case Ri (A, ω, r) = 1 into Eqn. 8.10 and taking the partial derivative with respect to θ̂(∆,r,j) , ∂Q(Φ0 , Φ(t) ) = ∂ θ̂(∆,r,j) n X X X (8.27) X  p(Ci (A, ω, r, j, z) = c0 , | Ri (A, ω, r) = 1, Di , Φ(t) ) i=1 ω∈ΩA z∈Γ(ω,r,j) : c0 ∈{0,1} ω−z=∆ n X X X X − c0 θ̂(∆,r,j)  p(Ci (A, ω, r, j, z) = c0 , | Ri (A, ω, r) = 1, Di , Φ(t) ) i=1 ω∈ΩA z∈Γ(ω,r,j) : c0 ∈{0,1} ω−z=∆  1 − c0 1 − θ̂(∆,r,j) where we have written the form of the expectation under the posterior explicitly. Setting the derivative to zero and solving for θ̂(∆,r,j) yields n P θ̂(∆,r,j) = P P p(Ci (A, ω, r, j, z) = 1, | Ri (A, ω, r) = 1, Di , Φ(t) ) i=1 ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ n P P 1 . ω∈ΩA z∈Γ(ω,r,j) : ω−z=∆ (8.28)  74 8.3.2 Approximate E-step In this subsection, we describe an approximation scheme to compute the posterior quantities necessary for the M-step updates of the model parameters Φ. Recall that in addition to computing approximate marginals over a single random variable, the results of LBP can be used to compute approximate joint marginals over a set of random variables. Definition 40 Let f be a factor node with neighbours U = N (f ) and potential Ψf . Let xU denote an outcome for the variables in U . LBP approximates the joint marginal disttribution p(xU ) by p̂(xU ) ∝ Ψf (xU ) Y µu→f (xu ) (8.29) u∈U where the distribution is normalized to sum to one over all possible settings of xU . When the factor graph contains no loops p̂(xU ) matches the true joint marginal. However, the factor graphs considered in the PSG framework generally contain loops, and so p̂(xU ) is generally only an approximation to the true joint marginal. Nevertheless, in practice, p̂(xU ) serves as a useful approximation. The main result of this subsection is stated below. Proposition 41 Consider the posterior quantities necessary for the M-step updates of Φ. All such posterior quantities can be approximated using the messages of LBP and Eqn. 8.29. Proof of Proposition 41 The posterior quantities we seek to approximate are • p(Xi (A, ω), par(Xi (A, ω) = 0) | Di , Φ(t) ) for (A, ω) ∈ B, 1 ≤ i ≤ n, • p(Xi (A, ω) = 1, Ri (A, ω) = I(Er ) | Di , Φ(t) ) for (A, ω) ∈ B, r ∈ RA , 1 ≤ i ≤ n, • p(Ci (A, ω, r, j) | Ri (A, ω, r) = 1, Di , Φ(t) ) for (A, ω) ∈ B, r ∈ RA , j ∈ ΩA(r,j) , 1 ≤ i ≤ n. 75 Let U be a set of random variables that participates in a posterior quantity we seek to approximate. To prove the proposition, it is sufficient to show that there exists a factor f with U ⊆ N (f ) for each posterior quantity of interest. Recall the factor graph representation from Figure 4.2. For the posterior quantities of the form p(Xi (A, ω), par(Xi (A, ω) = 0) | Di , Φ(t) ), the factors 1 f(A,ω) contain the random variables Xi (A, ω) and par(Xi (A, ω)) as neighbours. For the posterior quantities of the form p(Xi (A, ω) = 1, Ri (A, ω) = er | Di , Φ(t) ), the factors 2 f(A,ω) contain the random variables p(Xi (A, ω)) and Ri (A, ω) as neighbours. For the posterior quantities of the form p(Ci (A, ω, r, j) | Ri (A, ω, r) = 1, Di , Φ(t) ), the factors 3 f(A,ω,r,i) contain the random variables Ci (A, ω, r, j) and Ri (A, ω, r) as neighbours. To perform an approximate E-step in the PSG framework, we run LBP to convergence and use Eqn. 8.29 to approximate each posterior quantity of interest. 8.4 Effectiveness of approximate EM learning Unlike the standard EM algorithm, the approximate EM algorithm we describe in this chapter is not guaranteed to increase a lower bound on the marginal likelihood of the observed data. Also, note that the approximate EM algorithm has two sources of approximation. First, recall that the E-step in the EM algorithm described in Section 8.2 requires computation of exact posterior quantities. Here, we use an E-step that computes approximate posterior quantities by running LBP. Second, recall that the factor graph construction described in Definition 16 leads to an exact representation of the distribution over scenes induced by a PSG only when the grammar is acyclic. When the grammar is cyclic, the factor graph construction leads to an approximate representation of the PSG , and so any learning algorithm defined using the factor graph construction is fitting a potentially different (but related) model. Despite the different sources of approximation, in practice, the approximate EM algorithm outlined here is effective in learning model parameters. To demonstrate the effectiveness of the approximate EM algorithm described in this chapter, we show the performance of several PSG models on two scene understanding tasks as a function of approximate EM iteration. In Figure 8.1, we show performance in terms of area under a precisionrecall curve (AUC) for several PSG models on the tasks of contour detection and image segmentation (see Chapter 9 for a full description of the models and tasks). Note that higher AUC indicates superior performance. As shown in the figure, performance tends to improve with subsequent approximate EM 76 Figure 8.1: Area under the precision-recall curve (AUC) as a function of approximate EM iteration. Higher AUC indicates better performance. Each line in the plot corresponds to a PSG model for a particular scene understanding task indicated in brackets. Overall, the approximate EM algorithm described in this chapter seems to improve performance. See Chapter 9 for a full description of the models and tasks. iterations. Although the performance decreases slightly for one of the models/tasks, the approximate EM algorithm generally seems to improve performance. Chapter 9 Experiments To demonstrate the generality of the PSG framework, we show experimental results on three different scene understanding tasks: contour detection, face localization, and binary image segmentation. As discussed in Chapter 1, previous approaches for these tasks have typically employed fairly distinct methods. Here, we demonstrate that the PSG framework can address all three problems. In particular, we describe PSGs for each scene understanding task in the language of Definition 1 and demonstrate that LBP can be used as the inference engine for these tasks. We use partially-supervised learning1 to fit model parameters for contour detection and binary image segmentation, and supervised learning to fit model parameters for face localization. We report the speed of inference as performed on a laptop with an Intel R i7 2.5GHz CPU and 16 GB of RAM. Our framework is implemented in Matlab/C using a single thread. All experiments were performed using a common and general implementation of the PSG framework. To handle the different tasks in this general implementation, one simply expresses an appropriate PSG in a high-level “language” like the one used in Chapter3 and designs an appropriate data model. The implementation automatically constructs the factor graph, and performs parameter estimation (learning) and inference. 9.1 Contour detection To study contour detection, we use the Berkeley Segmentation Dataset (BSD500) described in [1] following the experimental setup described in [16]. The dataset contains natural images and object boundaries manually marked by human annotators. For our experiments, we used the standard split of the dataset with 200 training images and 200 test images. For each image we use the 1 So-called because the supervision labels specify only the presence/absence of a subset of bricks. 77 78 boundaries marked by a single human annotator to define ground-truth binary contour maps. From a binary contour map B we generate a noisy real-valued image D by sampling each pixel D(i, j) independently from a Normal distribution whose mean depends on the value of B(i, j). Formally, D(i, j) ∼ N (µB(i,j) , σ). (9.1) For the experiments, we used µ0 = 150, and µ1 = 100, σ = 40. 9.1.1 The PSG contour model The contour model we use in the experiments below is similar to the model described in Grammar 1, but with the model parameters {, q} learned in a partially-supervised approach. For learning, we treat the ground-truth contour maps B as observations for the grammar symbol INK, and use the approximate EM algorithm described in Chapter 8 to fit parameters. Note that we do not have fully observed data, since 1) we do not have observations for the states of the CURVE bricks, and 2) we do not have observations for the rule choices made by the bricks present in the scene. Recall that for approximate EM learning, we use LBP to compute approximations to the posterior quantities of interest during the E-step. To speed-up convergence of LBP, we use warm-starts between EM iterations. I.e., the LBP messages for the E-step of EM iteration t + 1 are initialized to the converged LBP messages from the E-step of EM iteration t. Grammar 6 shows the learned contour model. We will refer to this contour model as the “PSG contour model”. Grammar 6 PSG contour model: a grammar for contour detection learned in a partially-supervised setting. The function Tθ denotes a rotation in the plane by an angle θ and Round maps a point in the plane to the nearest grid point. 79 Σ = {CURVE, INK}. ΩCURVE = [N ] × [M ] × [8]. ΩINK = [N ] × [M ]. Rules: 0.647, (CURVE, (x, y, θ)) → (INK, δ((x, y))), (CURVE, δ(((x, y) + Round(Tθ (1, 0)), θ))) 0.147, (CURVE, (x, y, θ)) → (INK, δ((x, y))), (CURVE, δ(((x, y) + Round(Tθ (1, −1)), θ))) 0.152, (CURVE, (x, y, θ)) → (INK, δ((x, y))), (CURVE, δ(((x, y) + Round(Tθ (1, +1)), θ))) 0.019, (CURVE, (x, y, θ)) → (CURVE, δ((x, y, θ − 1))) 0.019, (CURVE, (x, y, θ)) → (CURVE, δ((x, y, θ + 1))) 0.012, (CURVE, (x, y, θ)) → (INK, δ((x, y))) 1.00, CURVE (INK, (x, y)) → ∅ −5 = 4.28 × 10 , INK = 1.87 × 10−12 . Recall that for inference, we convert a PSG to a factor graph and run LBP. To incorporate the data model given in Eqn. 9.1 into the factor graph representation, we attach unary potentials to the set of variables nodes {X(INK, (i, j)) | (i, j) ∈ ΩINK }. For a variable node X(INK, (i, j)) we attach a 0 unary potential f(INK,(i,j)) (X(INK, (i, j)) = x, D) = N (D(i, j); µx , σ). In this case, the resulting factor graph represents the conditional distribution p(S | D). 9.1.2 Qualitative contour detection results In Figure 9.1 we show contour detection results on examples from the BSDS500 test set. We show the approximate marginal probability that each pixel is part of a curve as computed by LBP, p̂(X(INK, (i, j)) = 1 | D). Running LBP to convergence on a 481 × 321 test image took on average 1.5 hours. As shown in Figure 9.1, despite the PSG contour model’s simplicity, it is able to perform reasonably well in detecting contours in noisy images. Note that the model sometimes has trouble localizing curved contours. Since the model is similar to a first-order Markov model, it is unable to faithfully model and capture high-order contour statistics, such as curvature, thus hampering its ability to detect contours that do not have low curvature. This suggests that modelling curvature is important for contour detection. As such, we believe more realistic models of contours will be able to capture richer curvature information and outperform the simple contour model described in 6. Nevertheless, the qualitative contour detection results demonstrate the feasibility of the PSG framework’s approach on this task. In the next section, we provide a qualitative analysis of the performance of the PSG contour model. Figures 9.2 shows more contour detection results on images from the BSDS500 at a larger resolution so the reader can examine more details. 80 Figure 9.1: Contour detection results on four examples from the BSDS500 test set. Top row: Ground-truth contour maps B. Middle row: Noisy observations, D. Bottom row: Visualization of the approximate marginal probabilities of INK bricks present in the scene computed by LBP. Each pixel represents an INK brick at that location. The gray-scale values show the approximate marginal probabilities p̂(X(INK, (i, j)) = 1 | D) at each pixel (i, j) computed by LBP. Darker pixels indicate a higher marginal probability. Despite the PSG contour model’s simplicity, the model is able to perform reasonably in detecting contours in noisy images. However, the model has trouble localizing curved contours; this is particularly evident in the left and rightmost examples. 9.1.3 Quantitative contour detection results In this subsection we perform a quantitative comparison between the PSG contour model and baseline models. Our chief comparison is to the Field-of-Patterns (FOP) models of [16], wherein their proposed model is specifically designed to recover binary images. To demonstrate the importance of context for contour detection, we also compare against a PSG where all of the compositional rules are of the form A → ∅; i.e. all bricks are independent. We will refer to this PSG as the “No-Context PSG ” since such a PSG relies solely on the data model for contour detection. For the PSG contour model and the No-Context PSG, we compute the area under the precision-recall curve (AUC) by threshholding p̂(X(INK, (i, j)) = 1 | D), (i, j) ∈ ΩINK , over a range of values. The authors of [16] have provided us with their experimental data, which we use to make comparisons. Table 9.1 compares the AUC of the PSG contour model to baselines. Figure 9.3 compares the precision-recall curves of the PSG contour model and baseline methods. Comparing the AUC achieved by the PSG contour model to the No-Context PSG it is clear that the use of contextual 81 Figure 9.2: Contour detection results for three examples from the BSDS500 test set. Top row: Ground-truth contour maps B. Middle row: Noisy observations, D. Bottom row: Visualization of the approximate marginal probabilities of INK bricks present in the scene computed by LBP. Each pixel represents an INK brick at that location. The gray-scale values show the approximate marginal probabilities p̂(X(INK, (i, j)) = 1 | D) at each pixel (i, j) computed by LBP. Darker pixels indicate a higher marginal probability. 82 Model No-Context PSG PSG contour model 1-level FOP, [16] 4-level FOP, [16] AUC 0.12 0.75 0.73 0.78 Table 9.1: AUC for the No-Context PSG, PSG contour model, and the 1-level and 4-level field-ofpatterns (FOP) models from [16]. Note that the PSG contour model is competitive with the 1-level FOP model of [16] which is specifically designed to recover binary images from noisy images. While the 4-level FOP model outperforms the PSG contour model, the PSG contour model is still competitive despite the generality of the PSG framework. Note that the No-Context PSG performs poorly, demonstrating that the use of contextual information is crucial for contour detection. information is of critical importance for high-quality contour detection. Note that the PSG contour model achieves competitive results compared to the FOP models described in [16], despite the PSG framework’s general-purpose nature. We believe it is possible to define more realistic models of contours in the PSG framework to improve performance. For example, one could make use of higher-order contour statistics such as curvature information, and operate in a multi-scale fashion similar to the 4-level FOP model; both of these concepts can in principle be described in the language of a PSG. However, the goal of this thesis is to demonstrate the generality of the PSG framework and we leave the design and structure learning of more sophisticated contour models for future research. 83 Figure 9.3: Precision-recall curves for the PSG contour model and baseline models. The FOP results were obtained from the authors of [16]. The AUC for each method is shown in the legend. The PSG contour model is competitive with the 1-level FOP model from [16], but is outperformed by the 4-level FOP model. The overall poor performance of the No-Context PSG demonstrates the importance of using a notion of context for contour detection. 84 9.2 Face Localization The PSG framework can be applied to the problem of object localization. Here, we demonstrate the application of PSGs to the problem of face localization. The goal is to localize one or more faces in a set of images, as well as the faces’ left eye, right eye, nose, and mouth. We study face localization on two datasets; one dataset has a single face per image, and the other has multiple faces per image. We describe the datasets below. 9.2.1 Dataset: Labelled Faces in the Wild To study face localization when there is exactly one face in the image, we use the Labelled Faces in the Wild (LFW) dataset introduced in [27]. The dataset contains faces in unconstrained environments. We randomly select 200 images for training, and 100 images for testing. Although the dataset comes annotated with the identity of the person in the image, it does not come with part annotations. We manually annotate all training and test images with bounding box information for the face, left eye, right eye, nose, and mouth. Examples of bounding box annotations are shown Figure 9.4. 9.2.2 Dataset: Family Portraits The LFW dataset images are constrained to have only one face per image and is not suitable for evaluating localization performance when there are multiple faces in an image. To study multiple face detection, we collect a dataset of 40 images of family and class portraits taken from the Internet. We used the search string “family portraits”, “class portraits” and “school portraits” on GoogleTM in November 2016. We manually annotated each image with bounding box information for the face, left eye, right eye, nose, and mouth. Examples of bounding box annotations are shown in the Figure 9.5. On average, there are 5.9 faces per image. We refer to this dataset as “Portraits”. 9.2.3 Face Detection Grammar The PSG we use for face detection experiments is similar to the grammar described in Grammar 2, but with several differences: • The EYE symbol in the grammar is replaced by LEFT-EYE and RIGHT-EYE symbols. Thus, the grammar distinguishes between left eyes and right eyes. • Scale information is included in the pose space. This enables the grammar to express relationships such as “a small face has a small mouth that is only a few pixels below the centre of the face” and “a large face has a large mouth that can be many pixels below the centre of 85 Figure 9.4: Examples of manually annotated images from the LFW dataset. Images are annotated with bounding boxes for the face (red), left eye (green), right eye (blue), nose (cyan), and mouth (magenta). Note that we distinguish between left and right eyes. All LFW images are 250 × 250 pixels. the face”. The pose space is defined so that objects detected at smaller scales can be localized with higher precision than objects detected at larger scales. • The grammar does not use Uniform conditional pose distributions to express the geometric relationship between a face and its constituent parts. Instead, the conditional pose distributions are Categorical distributions whose parameters are learned in a supervised learning approach described later in Section 9.2.5 • The grammar contains symbols that represent the concept of “look-alikes”. “Look-alike” symbols provide a mechanism for the PSG to handle false positives that arise due to weaknesses in the given data model. For example, a MOUTH “look-alike” brick represents an entity that merely looks like a mouth under the data model, but may not truly be a mouth. Given an image patch that looks like a mouth, there are two possibilities for the image patch: 1) the image patch is truly a mouth with other face parts nearby, or 2) the image patch only looks like a mouth with no other face parts nearby. The “look-alike” symbols explicitly model these possibilities in the grammar. We include corresponding look-alike symbols for the FACE LEFT-EYE, RIGHT-EYE, NOSE, and MOUTH symbols. Although not an integral part of the model, we have found that in practice, “look-alike” symbols improve performance by reducing false detections. The problem of false detections caused by weaknesses in the data model, especially those based on gradient information, is discussed in [62]. We will denote “look-alike” symbols with the prefix T−. We will refer to our PSG for face localization as the “PSG Face Grammar” model. The specification of the PSG Face Grammar is given in Grammar 7. We use L to denote the number of 86 Figure 9.5: Examples of manually annotated images from the Portraits dataset. Images for both datasets are annotated with bounding boxes for the face (red), left eye (green), right eye (blue), nose (cyan), and mouth (magenta). Note that we distinguish between left and right eyes. The sizes of the images in the Portraits dataset is variable. 87 scales considered for all symbols in the grammar, and [Ns ] × [Ms ] denotes a grid of points at a scale 1 ≤ s ≤ L. Grammar 7 The PSG Face Grammar: Σ = {FACE, LEFT-EYE, RIGHT-EYE, NOSE, MOUTH, T-FACE, T-LEFT-EYE, T-RIGHT-EYE, T-NOSE, T-MOUTH} L S ∀A ∈ Σ, ΩA = {[Ns ] × [Ms ]}. s=1 Rules: 1.0, (FACE, ω) → (T-FACE, Categorical(· | θ(ω,1,1) )), (LEFT-EYE, Categorical(· | θ(ω,1,2) )), (RIGHT-EYE, Categorical(· | θ(ω,1,3) )), (NOSE, Categorical(· | θ(ω,1,4) ), (MOUTH, Categorical(· | θ(ω,1,5) )) 1.0, (LEFT-EYE, ω) → (T-LEFT-EYE, δ(ω)) 1.0, (RIGHT-EYE, ω) → (T-RIGHT-EYE, δ(ω)) 1.0, (NOSE, ω) → (T-NOSE, δ(ω)) 1.0, (MOUTH, ω) → (T-MOUTH, δ(ω)) 1.0, (T-LEFT-EYE, ω) → ∅ 1.0, (T-RIGHT-EYE, ω) → ∅ 1.0, (T-NOSE, ω) 1.0, (T-MOUTH, ω) FACE = 10−4 → ∅ → ∅ LEFT-EYE = RIGHT-EYE = NOSE = MOUTH = 10−12 T-FACE = T-LEFT-EYE = T-RIGHT-EYE = T-NOSE = T-MOUTH = 10−4 . 9.2.4 Face data model We incorporate image evidence for a brick (A, ω) to be present/absent in a scene by attaching a unary potential to the variable node X(A, ω), A ∈ {T-FACE,T-LEFT-EYE,T-RIGHT-EYE,T-NOSE, T-MOUTH}, ω ∈ ΩA in the factor graph. In other words, only the “look-alike” symbols have an 0 associated data model. Here, we describe the form of the factor f(A,ω) we use for face and face part localization. 0 Given an image Y, we denote a unary potential attached to X(A, ω) by f(A,ω) (X(A, ω), Y). We define the unary potentials for a symbol A using a histogram-of-oriented-gradients (HOG) filter (see 88 [9] for a description of HOG features). Let H(A,ω) (Y) be the response of a HOG filter associated with symbol A and pose ω in an image Y. We define 0 f(A,ω) (X(A, ω), Y) = pA (H(A,ω) (Y) | X(A, ω)) (9.2) where pA is a conditional distribution over discretized HOG detection scores for symbol A. The procedure to fit these distributions is described below. Note that each brick associated with a unary potential is also associated with a HOG detection score, and so a HOG filter. Thus, each brick has a bounding box associated with it corresponding to the spatial extent of the HOG filter. To build the data model, we first train HOG filters using publicly-available code from [12]2 . We train separate filters for each “look-alike” symbol using annotated bounding boxes to define positive examples. The negative examples are taken from the PASCAL VOC 2012 dataset described in [11], with images containing the class “People” removed. We use 10 scales per octave for each object/part and do not use hard negative mining. Figure 9.6 shows a visualization of the HOG filters learned using 200 images from the LFW as positive examples. For all face detection experiments, the positive examples are taken from the LFW dataset. For a symbol A, to construct pA (· | X(A, ω) = 1) we first obtain a set of detection scores by finding in each image the highest HOG detection score whose associated spatial extent has an intersection-over-union measure of at least 0.7 with the ground truth bounding box. We then clamp the detection scores to be in the range [−2, 2], construct a 20-bin frequency histogram of detection scores, normalize the histogram to sum to 1, and finally smooth the distribution by a Gaussian kernel to obtain pA (· | X(A, ω) = 1). To construct pA (· | X(A, ω) = 0), we use a similar approach, but we use all the detection scores in each image as the set of HOG detection scores. Figure 9.7 shows the learned distributions pA (· | X(A, ω) = 1) and pA (· | X(A, ω) = 0). 9.2.5 Fitting model parameters For all face detection experiments, we fit the conditional pose distributions of the PSG Face Grammar using the LFW dataset. To fit the conditional pose distributions, we use ground truth bounding information to provide supervision. For each face in the training set, we have its bounding box and the bounding box for its constituent parts. We convert each ground truth bounding box for the face, left eye, right eye, nose, and mouth into a pose for the corresponding symbol in the grammar. To do this, first recall that bricks has an associated bounding box. We select the pose associated with the 2 https://cs.brown.edu/~pff/latent-release4/ 89 Figure 9.6: Visualization of the HOG filters learned using 200 examples from the LFW as positive examples. Note that the HOG filters for the T-LEFT-EYE and T-RIGHT-EYE symbols are subtly different, indicating there is a visual difference between the two parts. Also note that the T-MOUTH filter shares some similarities to both the T-LEFT-EYE and T-RIGHT-EYE filters, indicating that HOG filters may not be an ideal feature representation to distinguish between mouths and eyes. 90 Figure 9.7: Distributions over HOG detections scores representing the data model. 91 highest detection score with an intersection-over-union measure of at least 0.7 as the ground truth bounding box. This process converts each annotated bounding box into a pose in the pose space of the corresponding symbol. Using this information, we can fit the conditional pose distributions using maximum likelihood estimation. Note that since each symbol occurs only once in the left-hand-side of in the set of rules, there is no need learn the parameters q. We keep the self-rooting probabilities fixed to those given in the PSG Face Grammar model (Grammar 7). Note that the parts of the face, {LEFT-EYE, RIGHT-EYE, NOSE, MOUTH }, have low self-rooting probability (10−12 ), indicating that the model places low probability on the event that these symbols appear on their own. In contrast, the corresponding “look-alike” symbols have a much higher self-rooting probability (10−4 ). As a result, an image region that looks like a face part but appears on its own is more likely to be explained as a self-rooting “look-alike” symbol rather than as a true face part. 9.2.6 Face localization results on single-face images: LFW The data model and conditional pose distributions were fit using 200 annotated training examples from the LFW dataset. We use a separate 100 examples for testing. The output of LBP with the PSG Face Grammar gives for each brick (A, ω) in the scene, an approximate marginal probability that the brick is present: p̂(X(A, ω) = 1 | Y). Since there is only a single face in each image in the LFW dataset, to perform face localization in an image, ∀A ∈ {FACE, RIGHT-EYE, LEFT-EYE, NOSE, MOUTH} we output ω ∗ = arg max p̂(X(A, ω) = 1 | Y) (9.3) ω∈ΩA as the predicted pose for symbol A in the scene. As baseline models, we use our own implementation of Pictorial Structures and a model that uses only the individual HOG filter scores to perform localization of each part independently. We refer to the latter approach as the “HOG Filters” approach. To perform inference with Pictorial Structures, we use the MRF representation of a Pictorial Structures model described in Chapter 7. The symbols of the Pictorial Structures model are FACE, LEFT-EYE, RIGHT-EYE, NOSE, and MOUTH. The pose spaces for the symbols are the same as in the PSG Face Grammar. Since the MRF is acyclic, the marginal probabilities can be computed exactly using dynamic programming and we use Eqn. 9.3 to output a predicted pose for each symbol. To perform inference using only HOG filter scores, the predicted pose for each symbol A ∈ {FACE, LEFT-EYE, RIGHT-EYE, NOSE, MOUTH} given an image Y is 92 ω ∗ = arg max ω∈ΩA pA (H(A,ω) (Y) | X(A, ω) = 1) . pA (H(A,ω) (Y) | X(A, ω) = 0) (9.4) Inference for the PSG Face Grammar model using LBP on a 250 × 250 test image took 120 seconds, and inference for both Pictorial Structures and the HOG Filters baseline models took around 5 seconds. We show qualitative localization results in Figure 9.8. As shown in Figure 9.8, the HOG Filters model performs poorly and often confuses mouths with left eyes and right eyes. This occurs in three of the four examples shown and can be attributed to the similarity of the learned HOG filters for mouths and eyes, as shown in Figure 9.6. In contrast, the PSG Face Grammar does not confuse mouths and eyes since it uses geometric information encoded in the conditional pose distributions to localize the parts of the face. The use of geometric structure and “look-alike” symbols help the PSG Face Grammar to compensate for the ambiguous data model and robustly perform face localization. Pictorial Structures performs similarly to the PSG Face Grammar on this dataset. This is to be expected since one major difference between Pictorial Structures and the PSG Face Grammar is that Pictorial Structures assumes there is only one object of each type per image, which is an accurate assumption for the LFW dataset. Comparing the results of the HOG Filters to the results of the PSG Face Grammar and Pictorial Structures models demonstrates contextual information is crucial for accurate face and face-part localization. Table 9.2 provides a quantitative evaluation of the PSG Face Grammar model and the baseline models in terms of mean distance away from centre of the ground truth bounding box. Model HOG Filters Pictorial Structures PSG Face Grammar FACE 3.7 3.3 3.5 LEFT-EYE 4.7 2.6 2.6 RIGHT-EYE 8.2 3.1 3.3 NOSE 3.3 2.4 2.4 MOUTH 13.6 3.4 3.5 Average 6.7 3.0 3.1 Table 9.2: Mean distance of top detections to the centre of the ground truth bounding box, in pixels, on the LFW dataset. A key difference between Pictorial Structures and the PSG Face Grammar is that Pictorial Structures assumes one object per image, while the PSG Face Grammar makes no such assumption. However, in the LFW dataset, there is indeed only one face per image, and so the two models perform similarly on this dataset. Table 9.3 provides an evaluation in terms of area under the precision-recall curves. For this evaluation, we perform non-maximum suppression for the symbols FACE, LEFT-EYE, RIGHT-EYE, NOSE, MOUTH separately. We first sort the detection probabilities for each symbol, then perform suppression so that no two detections of the same symbol overlap. We consider a detection to be a true positive if it is the highest scoring detection with an intersection-over-union ratio of at least 0.5 with the ground truth bounding box. We consider a detection a false positive if it is not the 93 Ground Truth HOG Filters PSG Face Grammar Pictorial Structures Figure 9.8: Localization results on four examples from the LFW dataset. Left: annotated groundtruth bounding boxes. Middle-Left: results of the HOG Filters model. Middle-Right: results of the PSG Face Grammar model. Right: results of Pictorial Structures. The parts are FACE (red), LEFT-EYE (green), RIGHT-EYE (blue), NOSE (cyan), and MOUTH (magenta). For each symbol, we show the bounding box corresponding to the pose with the highest computed marginal probability. Note that both the PSG Face Grammar model and Pictorial Structures perform well while the HOG Filter model performs poorly in some cases, suggesting that using geometric information is crucial for accurate localization. 94 highest scoring detection with an intersection-over-union ratio of at least 0.5 with the ground truth bounding box, thus penalizing multiple detections of the same object. Once again, the HOG Filters model performs poorly while the PSG Face Grammar model and Pictorial Structures perform well, demonstrating the importance of geometric information. Model HOG Filters Pictorial Structures PSG Face Grammar FACE 1.00 1.00 1.00 LEFT-EYE 0.76 0.97 0.98 RIGHT-EYE 0.65 0.93 0.92 NOSE 0.96 0.98 0.98 MOUTH 0.60 0.90 0.92 Average 0.80 0.96 0.96 Table 9.3: Area under the precision-recall curve on the LFW dataset. Note that the HOG Filters model performs significantly worse than the PSG Face Grammar model and Pictorial Structures, demonstrating the importance of contextual information for accurate object localization. The PSG Face Grammar model and Pictorial Structures perform similarly, as is expected since there is only one face per image in this dataset. 9.2.7 Face localization results on multiple-face images: Portraits A key difference between the general PSG framework and the Pictorial Structures model of [13] is that the PSG framework makes no assumptions concerning the number of symbols of each type in an image, while Pictorial Structures assumes there is exactly one symbol of each type in an image. As such, while the performance of both approaches may be similar when localizing faces in scenes with a single face, performance may be quite different in scenes with a variable number of faces. To study the ability of the PSG Face Grammar and baseline methods to detect multiple faces in an image, we perform face localization on the Portraits dataset described in Section 9.2.2. We use the same PSG Face Grammar, model parameters, and data model as in the LFW face detection experiments. Figures 9.4 and 9.5 shows a qualitative localization comparison between the PSG Face Grammar and the baseline methods on the Portraits dataset. We show the top K detection for each symbol after performing non-maximum suppression, where K is the ground truth number of faces in the image. Non-maximum is performed in the same fashion as described in Section 9.2.6. Table 9.6 compares the area under the precision-recall curves for the PSG Face Grammar and baseline methods. For this evaluation, we use the same non-maxima suppression approach and criterion for true/false positives as for the LFW dataset. In particular, we perform non-maximum suppression for the symbols FACE, LEFT-EYE, RIGHT-EYE, NOSE, MOUTH separately. We first sort the detection probabilities for each symbol, then perform suppression so that no two detections of the same symbol overlap. We consider a detection to be a true positive if it is the highest scoring 95 Table 9.4: Top K localization results on two examples from the Portraits dataset after non-maximum suppresion. K is set to the ground-truth number of faces in the image for visualization purposes. Top row: annotated ground-truth bounding boxes. Middle-top row: results of the HOG Filters model. Middle-bottom row: results of Pictorial Structures. Bottom row: results of the PSG Face Grammar model. The parts are FACE (red), LEFT-EYE (green), RIGHT-EYE (blue), NOSE (cyan), and MOUTH (magenta). Note that in the both examples, Pictorial Structures makes a mistake in localizing the mouth of one of the subjects. However, the PSG Face Grammar model does not make this mistake. This is because of the PSG Face Grammar model’s use of “look-alike” symbols, which is a concept that cannot be captured in Pictorial Structures. The HOG Filters model performs poorly, demonstrating the importance of using contextual information in object localization. 96 Table 9.5: Top K localization results on two examples from the Portraits dataset after non-maximum suppression. Top row: annotated ground-truth bounding boxes. K is set to the ground-truth number of faces in the image for visualization purposes. Middle-top row: results of the HOG Filters model. Middle-bottom row: results of Pictorial Structures. Bottom row: results of the PSG Face Grammar model. The parts are FACE (red), LEFT-EYE (green), RIGHT-EYE (blue), NOSE (cyan), and MOUTH (magenta). The example on the right shows a failure mode for all models. The lightning creates a challenging environment due to shadows, the left-most subject’s head is significantly rotated, and the pattern on the couch resembles a face. A richer PSG model that includes in-plane rotation as part of the pose space may be able to address the failure modes in the example on the right. 97 detection with an intersection-over-union ratio of at least 0.5 with the ground truth bounding box. We consider a detection a false positive if it is not the highest scoring detection with an intersection-overunion ratio of at least 0.5 with the ground truth bounding box, thus penalizing multiple detections of the same object. Model HOG Filters Pictorial Structures PSG Face Grammar FACE 0.95 0.97 0.97 LEFT-EYE 0.50 0.78 0.81 RIGHT-EYE 0.48 0.69 0.81 NOSE 0.90 0.96 0.96 MOUTH 0.32 0.73 0.80 Average 0.63 0.82 0.87 Table 9.6: Area under the Precision-Recall curves for the Portraits dataset. Note that the HOG Filters model performs much worse than the PSG Face Grammar, as was the case in the LFW dataset results. Here, however, the PSG Face Grammar outperforms the Pictorial Structures model. A key difference between the two models is that the Pictorial Structures model assumes that there is one face per image, while the PSG Face Grammar does not. Since the Portraits dataset contains a variable number of faces per image, the one-face assumption made by Pictorial Structures is violated, thus leading to degraded performance. Unlike the results on the LFW dataset, on the Portraits dataset, the PSG Face Grammar model significantly outperforms the Pictorial Structures model. The key difference between the two models is that the Pictorial Structures model assumes that there is one face per image, while the PSG Face Grammar does not make that assumption. Since the Portraits dataset contains a variable number of faces per image, the one-face assumption made by Pictorial Structures is violated. This causes Pictorial Structures to become unable to select a consistent detection threshold to report positive detections since such a threshold is dependent on the number of faces in the scene. To demonstrate this point, consider the case where there are K identical faces in a scene. Since Pictorial Structures assumes that there is only one face in the scene, each of the K faces receives 1 K of the probability mass. If K can vary between images, as is the case here, it is not possible to set a consistently tight detection threshold. The PSG Face Grammar model does not suffer from this consistent threshold issue since it does not make any assumptions concerning the number of faces in the scene. 9.2.8 Face localization without a Face data model We argue that contextual information plays a key role in object localization. To study the role of contextual information in the task of face localization, we repeat the experiments on the LFW and Portraits dataset using the PSG Face Grammar, but without a data model for the T-FACE symbol. In other words, there are no unary potentials attached to the random variables X(T-FACE, ω), ω ∈ ΩT-FACE . In this setting, the notion of a face is solely defined in terms of its relationship to its parts; 98 the idea of a face is a purely abstract concept. In this section, we explore the ability of this “Faceless Grammar” to perform face localization despite not having a data model for faces. We compare the area under the precision-recall curves on the Portraits dataset in Table 9.7. Although the Faceless Grammar model does worse on this measure, the face localization still performs reasonably well, achieving an area under the precision-recall curve of 0.93.This demonstrates that it is possible to perform face localization without an explicit data model for faces. Model PSG Face Grammar Faceless Grammar FACE 0.97 0.93 LEFT-EYE 0.81 0.78 RIGHT-EYE 0.81 0.80 NOSE 0.96 0.95 MOUTH 0.80 0.76 Average 0.87 0.84 Table 9.7: Area under the precision-recall curves on the Portraits dataset. Note that the Faceless Grammar performs worse than the PSG Face Grammar. However, the Faceless Grammar performs reasonably well considering it is attempting to localize an object for which it has no image evidence. 99 9.3 Binary image segmentation To study binary image segmentation, we use the Swedish Leaf Dataset described in [53]. We use only the Rowan leaves class in our experiments because of their varied and complex shapes. The Rowan leaf class contains 75 examples. Following the experimental setup described in [16], we use 50 examples for training and the rest for testing. Each example contains exactly one Rowan leaf and is encoded as a binary map B. From a binary map B we generate a noisy real-valued image D by sampling each pixel D(i, j) independently from a Normal distribution whose mean depends on the value of B(i, j). Formally, D(i, j) ∼ N (µB(i,j) , σ). (9.5) For the experiments, we used µ0 = 150, µ1 = 100, σ = 100. Note that the data model in 9.5 is the same as for contour detection, but in these experiments we use a higher value of σ. Figure 9.9 shows examples of binary maps and the noisy real-valued images. Figure 9.9: Examples of the data used for the binary image segmentation experiments. Top row: examples of B. Foreground pixels are shown in black. Bottom row: corresponding examples of D. 100 9.3.1 The PSG binary image segmentation models In the experiments below, we study two PSGs for binary image segmentation. Both PSG grammars are cyclic, and we constrain the parameters θ so that θ(ω,r,i,z) = θ(ω0 ,r,i,z 0 ) whenever ω − z = ω 0 − z 0 . I.e., the conditional pose distributions are constrained to be shift-invariant. We also constrain the PSGs so that no brick may generate itself in one production rule. The first PSG we study for binary image segmentation is similar to the one described in Grammar 3, but with different model parameters. The model parameters are learned by using the ground-truth contour maps B as observations for the symbol FG then using the approximate EM algorithm described in Chapter 8. Note that this is a partially-supervised setting since the supervision labels specify only the presence/absence of a subset of bricks. The learned grammar is shown in Grammar 8. Note that this grammar is cyclic, and so the factor graph construction described in Definition 16 leads to a different (but related) distribution over scenes. We will refer to this grammar as the “Simple Segmentation Grammar”. For readability, we denote θ(ω,1,1) by θSEED since |ΩSEED | = 1, and θ(ω,2,1) by θ(ω,FG) ∀ω ∈ ΩFG . Grammar 8 The “Simple Segmentation Grammar” for 2-D binary image segmentation in an N ×M scene with model parameters learned in a partially unsupervised setting: Σ = {SEED, FG}. ΩSEED = [1]. ΩFG = [N ] × [M ]. Rules: 1.0, (SEED, 1) → (FG, Categorical(· | θSEED )) 1.0, (FG, ω) SEED = 1, → (FG, IndBern(· | θ(ω,FG) )) FG = 0. Recall that for inference, we convert a PSG to a factor graph and run LBP. To incorporate the data model given in 9.5 into the factor graph representation, we attach unary potentials to the set of variables nodes {X(FG, (i, j)) | (i, j) ∈ ΩFG }. In particular, for variable node X(FG, (i, j)) we 0 attach a unary potential f(FG,(i,j)) (X(FG, (i, j)) = x, D) = N (D(i, j); µx , σ). One weakness of the Simple Segmentation Grammar is that it is incapable of modeling structured variations in local shape. Compare the shapes of the foreground in 3 × 3 regions around a pixel on stem of the leaf, and 3 × 3 regions around a pixel on a lobe (components jutting off the stem). Locally, the shape of the foreground is very different around these two areas. Around a pixel located on the stem of the leaf, the foreground tends to extend above and below the pixel. Around a pixel located 101 in the middle of a lobe, the foreground tends to extend in all directions. To model these structured variations in the local shape of the foreground, we use a PSG that has the capacity to model different local foreground shapes. Grammar 9 describes a binary image segmentation model with a capacity to model different local foreground shapes 3 . Note that this grammar is cyclic, and so the factor graph construction described in Definition 16 leads to a different (but related) distribution over scenes Each symbol Sj , 1 ≤ j ≤ 5 models a different local shape. Each brick (Sj , ω), ω ∈ ΩSj can generate a brick (FG, ω) and other Sj bricks in an 8-neighbourhood around it, or it can generate a brick (Sk , ω), k 6= j to model a change of local shape. The single SEED brick chooses an S1 brick to start the generative process. The model parameters are learned in the same fashion as for Grammar 8. Note that apriori, the set of symbols symbols {Sj | 2 ≤ j ≤ 5} have no semantic meaning and are exchangeable in the model. To break symmetries in the model, we randomly initialize the model parameters relating to the set of symbols symbols {Sj | 2 ≤ j ≤ 5}. We will refer to this model as the “5-component Segmentation Grammar”. For both binary segmentation models described above, to incorporate the data model given in Eqn. 9.5 into the factor graph representation, we attach unary potentials to the set of variables nodes {X(FG, (i, j)) | (i, j) ∈ ΩFG }. For a variable node X(FG, (i, j)) we attach a unary potential 0 f(FG,(i,j)) (X(FG, (i, j)) = x, D) = N (D(i, j); µx , σ). In this case, the resulting factor graph represents the conditional distribution p(S | D). 3 For readability, we denote θ(ω,1,1) by θSEED since |ΩSEED | = 1. We also denote θ(ω,2,1) by θ(ω,S1 ) , θ(ω,7,1) by θ(ω,S2 ) , θ(ω,12,1) by θ(ω,S3 ) , θ(ω,17,1) by θ(ω,S4 ) , and θ(ω,22,1) by θ(ω,S5 ) . 102 Grammar 9 The “5-component Segmentation Grammar” for 2-D binary image segmentation in an N × M scene with model parameters learned in a partially unsupervised setting: Σ = {SEED, FG, S1 , S2 , S3 , S4 , S5 }. ΩSEED = [1]. ΩFG = [N ] × [M ]. ΩSj = [N ] × [M ], 1 ≤ j ≤ 5. Rules: 1.000, (SEED, 1) → (S1 , Categorical(· | θSEED )) 0.841, (S1 , ω) → (S1 , IndBern(· | θ(ω,S1 ) )), (FG, δ(ω)) 0.042, (S1 , ω) → (S2 , δ(ω)) 0.040, (S1 , ω) → (S3 , δ(ω)) 0.037, (S1 , ω) → (S4 , δ(ω)) 0.040, (S1 , ω) → (S5 , δ(ω)) 0.824, (S2 , ω) → (S2 , IndBern(· | θ(ω,S2 ) )), (FG, δ(ω)) 0.043, (S2 , ω) → (S1 , δ(ω)) 0.045, (S2 , ω) → (S3 , δ(ω)) 0.042, (S2 , ω) → (S4 , δ(ω)) 0.045, (S2 , ω) → (S5 , δ(ω)) 0.842, (S3 , ω) → (S3 , IndBern(· | θ(ω,S3 ) )), (FG, δ(ω)) 0.038, (S3 , ω) → (S1 , δ(ω)) 0.043, (S3 , ω) → (S2 , δ(ω)) 0.037, (S3 , ω) → (S4 , δ(ω)) 0.040, (S3 , ω) → (S5 , δ(ω)) 0.858, (S4 , ω) → (S4 , IndBern(· | θ(ω,S4 ) )), (FG, δ(ω)) 0.033, (S4 , ω) → (S1 , δ(ω)) 0.038, (S4 , ω) → (S2 , δ(ω)) 0.036, (S4 , ω) → (S3 , δ(ω)) 0.035, (S4 , ω) → (S5 , δ(ω)) 0.844, (S5 , ω) → (S5 , IndBern(· | θ(ω,S5 ) )), (FG, δ(ω)) 0.037, (S5 , ω) → (S1 , δ(ω)) 0.042, (S5 , ω) → (S2 , δ(ω)) 0.040, (S5 , ω) → (S3 , δ(ω)) 0.037, (S5 , ω) → (S4 , δ(ω)) SEED = 1, FG = 0, Sj = 0, 1 ≤ j ≤ 5. To gain insight into the PSGs learned, we study the model parameters θ learned by both models. First, consider the parameters θSEED . For both models, this parameter encodes the distribution over 103 the FG brick that will start the generative process of creating the foreground of the leaf. The models must consider two issues in setting the parameters θSEED . Let B̄ represent the mean of the binary maps across training examples. Intuitively, one might expect θSEED to resemble B̄ since B̄(i, j) is the probability that the brick (FG, (i, j)) will be present in a scene in the training set. On the other hand, the model might prefer to start the generative process at a more central FG brick in the scene. Consider the probability that the generative process will cause brick (FG, (i, j)) to be present in the scene when the generative process starts at brick (FG, (i0 , j0 )). This probability decreases as the distance between (i, j) and (i0 , j0 ) increases, and so the model may favour more central locations to the start the generative process. Figure 9.10 shows a visualization of B̄ and the parameters θSEED learned for the Simple Segmentation Grammar and the 5-component Segmentation Grammar. Note that log(θSEED ) for both models resembles B̄. (a) Visualization of B̄. Darker pixels (b) Visualization of log(θSEED ) (c) Visualization of log(θSEED ) correspond to a higher probability of learned for the Simple Segmentation learned for the 5-component that pixel being labeled foreground Grammar. Segmentation Grammar. in the training images. Figure 9.10: Visualization of B̄ and the parameters θSEED learned for the Simple Segmentation Grammar and the 5-component Segmentation Grammar. For panels (b) and (c), darker pixels correspond to a higher value of θSEED for that location. For visualization, the parameters θSEED are shown in the log domain and linearly scaled to be between 0 and 1. In Figure 9.11, we show a visualization for the parameters θ(ω,FG) learned for the Simple Segmentation Grammar and θ(ω,Sj ) , 1 ≤ j ≤ 5, learned for the 5-component Segmentation Grammar. As shown in the figure, the parameter θ(ω,FG) learned for the Simple Segmentation Grammar tends to favour expanding a FG brick in all directions, while the parameters θ(ω,Sj ) , 1 ≤ j ≤ 5, encode variations in the shape of the local foreground. 104 Figure 9.11: Visualization of the learned parameters θ(ω,FG) and θ(ω,Sj ) , 1 ≤ j ≤ 5, for the Simple Segmentation Grammar and 5-component Segmentation Grammar, respectively. Darker pixels indicate a higher value of θ. Recall that we constrain the parameters θ(ω,FG) and θ(ω,Sj ) , 1 ≤ j ≤ 5, to be shift-invariant and so that a brick may not generate itself in one production. Given a brick with pose ω, the visualizations show the probability that the brick will generate each of its 8-neighbours where the centre pixel corresponds to the brick with pose ω. The visualizations have been scaled consistently for comparison. Top row: Visualization of the parameters θ(ω,FG) learned for the Simple Segmentation Grammar. Bottom row: Left to right: a visualization of the parameters θ(ω,Sj ) for j = [1, · · · , 5] learned for the 5-component Segmentation Grammar. 9.3.2 Qualitative binary image segmentation results In Figure 9.12 we show binary image segmentation results on examples from the Swedish Leaf Dataset described in [53]. We show the approximate marginal probability that each FG brick is present in the scene, p̂(X(FG, (i, j)) = 1 | D), as computed by LBP. On 256 × 256 test images, running LBP to convergence took less than 260 and 1900 seconds for the Simple Segmentation Grammar and the 5-component Segmentation Grammar, respectively. As shown in Figure 9.12, the Simple Segmentation Grammar creates “blob-like” foreground segmentations and does a poor job of differentiating the lobes of the leaves. In contrast, the 5component Segmentation Grammar can more faithfully capture the shape of the lobes of the leaves, although it does so crudely. The ability to more finely capture the shape of the lobes can be attributed to the explicit modeling of local variations in the shape of the foreground. However, the 5-component Segmentation Grammar is more susceptible to picking up speckled noise, as shown in the figure. Note that although both grammar models constrain scenes to have a single non-empty connected foreground component, the results of inference indicate that this constraint is being violated. As discussed in Section 6.3, the factor graph construction described in Chapter 4 does not match the distribution over scenes induced by a cyclic grammar. Since both the Simple Segmentation Grammar 105 Figure 9.12: Binary image segmentation results on three examples from the Swedish Leaf test set. In the bottom two rows, each pixel represents an FG brick at that location. The gray-scale values show the approximate marginal probabilities p̂(X(FG, (i, j)) = 1 | D) computed by LBP. Darker pixels indicate a higher approximate marginal probability. First row: Ground-truth contour maps B. Second row: Noisy observations, D. Third row: Results of the Simple Segmentation Grammar. Fourth row: Results of the 5-component Segmentation Grammar. 106 and the 5-component Segmentation Grammar are cyclic, the factor graphs on which LBP are run do not faithfully represent the PSGs they are derived from. Moreover, since LBP is an approximate inference scheme, there is no guarantee that LBP will produce marginals consistent with the single non-empty connected foreground constraint even if the factor graphs were faithful representations of the PSGs they are derived from. These two issues result in the approximate marginals produced by LBP being inconsistent with the constraint that the foreground be a single non-empty connected component. Nevertheless, as Figure 9.12 shows, the PSG framework still produces reasonable binary image segmentations despite these flaws. 9.3.3 Quantitative binary image segmentation results In this subsection, we perform a quantitative comparison between the Simple Segmentation Grammar, 5-component Segmentation Grammar, and the baseline methods. Our chief comparison is to the work FOP models of [16]. To demonstrate the importance of context for binary image segmentation detection, we also compare against a PSG where Σ = {FG}, the FG bricks are allowed to self-root, and the PSG’s only rule is FG → ∅; i.e., all bricks are independent. We will refer to this model as the “No-Context PSG”. As in the task of contour detection, for the PSG models, we compare performance using the area under the precision-recall curve (AUC) by thresholding p̂(X(FG, (i, j)) = 1 | D), (i, j) ∈ ΩFG , over a range of values. For the work of [16], the authors have shared their experimental results and we compute AUC in a similar fashion as for the PSG models. Table 9.8 compares the AUC of the Simple Segmentation Grammar, 5-component Segmentation Grammar, and No-context PSG as well as the 1-level and 4-level FOP models of [16]. Figure 9.13 compares the precision-recall curves of these methods. Note that the No-context PSG performs the worst out of all methods tested, demonstrating that some notion of context is crucial for producing high-quality binary image segmentations on this dataset. Also, the 5-component Segmentation Grammar significantly outperforms the Simple Segmentation Grammar, indicating the importance of modeling the variation in local foreground shape. Although the PSG segmentation models are outperformed by both FOP models, both the Simple Segmenetation Grammar and the 5-component Segmentation Grammar give reasonable results despite the general-purpose nature of the PSG framework. Also, the gap in performance between the best performing PSG model and the 1-level FOP model is relatively small. We believe it is possible to define more effective models for binary image segmentation. As illustrated in Section 6.3, the use of cyclic grammars can sometimes be problematic in the PSG framework. A different model for binary image segmentation could perhaps be specified as an acyclic 107 PSG, which may prove to be more effective than cyclic PSGs. For example, one could design a hierarchical acyclic PSG that can express long-range dependencies between FG bricks. The design of more sophisticated image segmentation models in the PSG framework is a future research goal. Model No-context PSG Simple Segmentation Grammar 5-component Segmentation Grammar 1-level FOP, [16] 4-level FOP, [16] AUC 0.310 0.911 0.956 0.967 0.976 Table 9.8: Comparison of AUC for several different models. See text for discussion. Figure 9.13: Precision-recall curves for several PSGs and the 1-level and 4-level FOP models. The AUC for each model is shown in the legend. Note the poor performance of the No-context PSG, demonstrating the importance of contextual information in binary image segmentation. Also note that the 5-component Segmentation Grammar significantly outperforms the Simple Segmentation Grammar, illustrating the importance of modeling the variation in the local shape of the foreground. Lastly, the best performing PSG model, the 5-component Segmentation Grammar, is competitive with the 1-level FOP model. Chapter 10 Grammar Transformations As discussed in Chapter 1, we seek efficient approximate inference algorithms as a scene understanding framework in practice may be deployed in time-sensitive scenarios. Recall that the run time of LBP in the factor graph representation of a PSG is linear in the number of edges of the factor graph. Unfortunately, the PSG factor graph may contain upwards of millions of edges for a moderately sized model and so LBP inference may be slow. For example, in the contour detection experiments described in Section 9.1, the factor graph had roughly 50 million edges and running LBP to convergence on a modern machine took approximately 1.5 hours. If one wishes to apply the PSG framework to more complex models than the ones expressed in this thesis and on larger scenes, then it is clear that the practical issues of run time (and memory) must be dealt with. In this chapter, we discuss strategies for reducing the number of edges in the factor graph representation of a PSG . The main idea presented in this chapter is that a Categorical conditional pose distribution in the PSG can be represented by a combination of distributions. In the PSG framework, as we will see shortly, the “cost” of representing a distribution is the support size of the distribution. So, we seek strategies and approximations of distributions that reduce their support size. We consider two special cases. First, we approximate a general N -D Categorical conditional pose distribution by a product of N one-dimensional Categorical conditional pose distributions. For example, suppose we have a distribution over two variables, p(X, Y ). We seek to represent this distribution by a factorized distribution p(X)p(Y ). Here, the total support size of p(X) and p(Y ) can be significantly less than the support size of p(X, Y ). Second, consider a Uniform distribution over K elements. We approximate this distribution as a combination of Uniform distributions, each one over a set of fewer elements. For example, consider a Uniform distribution over the set {0, 1, . . . , 99}. One could sample from this distribution by the process of first drawing X uniformly from the set {0, 10, . . . , 90}, Y uniformly from the set 108 109 {0, . . . , 9}, then declaring Z = X + Y as the sample drawn. The distribution over Z is uniform on the set {0, . . . , 99}, but here, Z is represented as a combination of two Uniform distributions over 10 elements. We make these ideas more concrete in the rest of this chapter and show how they can be applied in the PSG framework. 10.1 Counting factor graph edges In order to reduce the number of edges in the factor graph representation of a PSG, we first analyze how the number of edges depends on the parameters of the PSG . To simplify the analysis below, we assume that |Γ(ω,r,i) | = |Γ(ω0 ,r,i) | ∀ω, ω 0 ∈ ΩA(r,i) (i.e., |Γ(ω,r,i) | is a constant with respect to ω). We will use the notation |Γ(ω,r,i) | = S(r,i) ∀ω ∈ ΩA(r,i) . Recall the factor graph representation in Figure 4.2 for a single brick (A, ω), A ∈ Σ, ω ∈ ΩA . We can read off the number of edges connected to (degree of) each factor for a single brick. Table 10.1 summarizes the results as a function of the parameters of the PSG . Factor node type 1 f(A,ω) 2 f(A,ω) Degree of factor 1 + | par(X(A, ω))| 1 + |RA | nr P P |RA | + S(r,i) 3 f(A,ω,·,·) r∈RA i=1 Table 10.1: Number of edges connected to each type of factor for a single brick (A, ω). The total number of edges associated with each type of factor can be computed by summing over all bricks in the factor graph. Table 10.2 summarizes the results. Factor node type 1 f(·,·) Number of edges connected to factors of this type P P P | par(X(A, ω))| |ΩA | + = 2 f(·,·) 3 f(·,·) A∈Σ ω∈ΩA A∈Σ nr P P P |ΩA | + |ΩA | S(r,i) A∈Σ A∈Σ P P r∈RA i=1 |ΩA | + |ΩA ||RA | P A∈Σ P A∈Σ |ΩA ||RA | + A∈Σ P A∈Σ |ΩA | nr P P S(r,i) r∈RA i=1 Table 10.2: Number of edges connected to each type of factor over all bricks in the PSG factor graph. From Table 10.2, we can express the total number of edges in the factor graph as number of edges in factor graph = 2 X A∈Σ |ΩA |(1 + |RA | + nr X X r∈RA i=1 S(r,i) ). (10.1) 110 10.2 Reducing the number of factor graph edges We first define the Uniform distribution. Definition 42 Let W be a set of binary random variables indexed by a set Υ. Also, let T ⊆ Υ be a set. Recall that we define the set I(W ) = {k | Wk = 1, k ∈ Υ}. We define the Uniform distribution as Uniform(W ; T ) =    |T1 | ,  0 , otherwise P Wk = 1, I(W ) ⊆ T k∈Υ (10.2) For brevity, for the rest of this chapter we drop the argument W from the Uniform distribution, and will denote it as Uniform(T ). Examining Eqn. 10.1, to reduce the number of edges in the factor graph representation of a PSG, we can reduce the size of the pose spaces, the number of productions rules, and the size of the support of the conditional pose distributions. For some PSG models, the size of the pose spaces can be reduced without greatly affecting modeling power. For example, suppose the pose space for a symbol of a grammar was all pixel locations in a scene. Rather than consider all pixel locations, one could consider a coarse grid of pixel locations, e.g., every other pixel. If the pose space of a symbol includes orientation, as in Grammar 1, one could reduce the number of orientations considered. Production rules often model compositions between objects. For example, a compositional rule may model that a FACE is comprised of a LEFT-EYE, RIGHT-EYE, NOSE, and a MOUTH, as in Grammar 2. So, it may not be possible to reduce the number of production rules without drastically changing the model. Conditional pose distributions represent geometric relationships between objects. For example, such a distribution may encode the fact that the NOSE of a FACE is located somewhere in the middle of the FACE within a region of uncertainty. As can be seen from Eqn. 10.1, the number of edges in the factor graph grows linearly with the total size of the support of the conditional pose distributions. In the next two sections, we focus on strategies for representing and approximating conditional pose distributions as combinations of distributions. We consider two special cases. In Section 10.3, we consider approximating a general N -D Categorical distribution by a product of N one-dimensional distributions. In Section 10.4, we consider representing a Uniform distribution as a combination of Categorical distributions. These techniques will allow us to reduce the number of edges in the factor P P r graph via reducing the term r∈RA ni=1 S(r,i) in Eqn. 10.1. 111 10.3 Approximating an N -D distribution by a product of N 1-D distributions First, recall that we use the notation [M ] to indicate the set of points {0, . . . , M − 1}. Let X = {x1 , . . . , xN } with xi ∈ [Mi ]. Let p(X) be an N -D distribution. Our goal is to approximate p(X) Q by a product of N one-dimensional distributions N i=1 pi (xi ). Note that the representation of p(X) QN Q by i=1 pi (xi ) is exact only when the xi are independent. Also, while kp(X)k0 = N i=1 (Mi − 1), PN PN QN i=1 kpi (xi )k0 = i=1 (Mi − 1), and so approximating p(X) by i=1 pi (xi ) can lead to a significantly smaller total support size. To measure the quality of approximation of p(X) by QN i=1 pi (xi ), we use the Kullback-Leibler (KL) divergence, defined below. Definition 43 The Kullback-Leibler (KL) divergence between discrete probability distributions r and s is defined to be DKL (r||s) = X r(X) log( X r(X) ) s(X) (10.3) where the summation is the over the union of the supports of r and s. Proposition 44 Let X = {x1 , . . . , xN } with xi ∈ [Mi ]. Let p(X) be an N -D distribution Q that we seek to approximate by N i=1 pi (xi ). If the quality of approximation is measured as QN DKL (p(X)|| i=1 pi (xi )) and we seek to solve the optimization problem p∗1 (x1 ), . . . , p∗N (xN ) = arg min DKL (p(X)|| p1 (x1 ),...,pN (xN ) N Y pi (xi )) (10.4) i=1 then p∗i (xi ) = p(xi ), Proof of Proposition 44 1≤i≤N (10.5) 112 DKL (p(X)|| N Y pi (xi )) = X = X i=1 X  p(X)  p(X) log QN i=1 pi (xi ) p(X) log p(X) − X = X p(X) log p(X) − N XX X i=1 N X X (10.6) p(X) log(pi (xi )) (10.7) p(xi ) log(pi (xi )) (10.8) i=1 xi X Since the pi (xi ) are probability distributions, we have the constraint that they must sum to 1. Hence, solving Eqn. 10.4 is a constrained optimization problem. We use the method of Lagrange P multipliers to enforce the constraint that xi pi (xi ) = 1. We formulate the Lagrange function Q L(p(X), N i=1 pi (xi )): L(p(X), N Y pi (xi )) = DKL (p(X)|| i=1 N Y X pi (xi )) − λi ( pi (xi ) − 1). (10.9) xi i=1 Now, taking the partial derivative of the Lagrangian with respect to pi (xi ), L(p(X), N Q i=1 ∂pi (xi ) Q L(p(X), N i=1 pi (xi )) ∂pi (xi ) i=1 = = Setting N Q ∂DKL (p(X)|| pi (xi )) ∂pi (xi ) pi (xi )) − λi p(xi ) − λi . pi (xi ) (10.10) (10.11) = 0 and solving yields p∗i (xi ) = p(xi ), 1 ≤ i ≤ N. (10.12) Next, we give some examples of using Proposition 44 to approximate a 2-D distribution as a product of two 1-D distributions. Figure 10.1 shows examples of this approximation. Note that if the distribution p(x1 , x2 ) can be factorized, then DKL (p(x1 , x2 )||p∗1 (x1 )p∗2 (x2 )) = 0 and p(x1 , x2 ) = p∗1 (x1 )p∗2 (x2 ). Figure 10.1(a) shows an example where p(x1 , x2 ) can be factorized, and indeed, the solution given by Eqn. 10.5 is the factorization. Figures 10.1(b) and 10.1(c) show examples where p(x1 , x2 ) cannot be factorized. As can be seen, the quality of the approximation can sometimes be poor, especially if there is no structure to the distribution being approximated. A more thorough analysis of the approximation produced by Eqn. 10.5 can be found in [4]. 113 (a) Visualization of factorizing a separable 2-D Gaussian. Since the Gaussian is separable, it can be represented exactly as the product of two 1-D Gaussians. (b) Visualization of factorizing a non-separable 2-D Gaussian. Since this Gaussian is not separable, it cannot be represented exactly as the product of two 1-D Gaussians. (c) Visualization of factorizing a randomly-generated 2-D probability distribution. Note that the distribution is not separable, so its representation as the product of two 1-D distributions is not exact. Figure 10.1: Visualization of three examples of using Proposition 44 to approximate a 2-D probability distribution by a product two 1-D distributions. Left figures: visualization of a 2-D distribution p(x1 , x2 ). Right figures: visualization of p∗1 (x1 )p∗2 (x2 ). Darker pixels correspond to higher probabilities. 114 10.3.1 Alternative approximations To measure the disagreement between two distributions, one could use a function different than the particular KL divergence used in Proposition 44. One alternative is to reverse the direction of the Q KL divergence and instead use DKL ( N i=1 pi (xi )||p(X)). This direction of the KL divergence is the same as the one used in variational inference (see [63] and [4]), while the direction of the KL divergence in Proposition 44 is the same as in Expectation-Propagation (see [39]). Another alternative function to measure the disagreement between two distributions is the Frobenius Norm. In the special case of a 2-D distribution, the problem of finding an optimal decomposition of a 2-D distribution in terms of two 1-D distributions is related to finding the best rank-1 approximation of a matrix in the Frobenius Norm sense, which can be solved using the Singular Value Decomposition. This problem is also related to generating separable filters (see [52]). 10.4 Decomposing a 1D Uniform distribution Consider a Uniform distribution over the set [K]. In this section, we consider representing this distribution as a combination of Categorical distributions. Our motivation for doing so is that a Uniform distribution over the set [K] has support size of K, but a representation of this distribution as a combination of Categorical distributions may have total support less than K. As shown in Eqn. 10.1 the number of edges in the factor graph representation of a PSG is proportional to the total support of the conditional pose distributions. By representing a Uniform distribution as a combination of Categorical distributions, the PSG factor graph may have fewer edges. As an example, suppose p(z) is a Uniform distribution over the set [100]. Then, let p1 (z1 ) be a Uniform distribution over the set [10] and let p2 (z2 ) be a Uniform distribution over the set {0, 10, . . . , 90}. The distribution over z0 = z1 + z2 is uniform over the set [100] and the total support is 20. Definition 45 Consider a set of Categorical distributions P = {pi (z) | 1 ≤ i ≤ N }. Let Λi denote the support of pi (z) and let Λ = {Λi | 1 ≤ i ≤ N }. Define |Λ| = N X |Λi |. (10.13) i=1 Intuitively, the higher |Λ| is, the more factor graph edges must be used to represent P in the PSG factor graph. In this section, given a Uniform distribution, we seek a representation of it as a set of Categorical distributions P with an associated set of supports Λ such that |Λ| is minimized over all possible choices of P . We first show how to compute the minimum value of |Λ|, and then show how one can find such a P that achieves this minimum value for |Λ|. 115 Theorem 46 Let P = {pi (z) | 1 ≤ i ≤ N } be a set of Categorical distributions specified so that P the sum z0 = N i=1 zi , zi ∼ pi (z) is distributed uniformly on the set [K]. Let Λ = {Λi | 1 ≤ i ≤ N } where Λi is the support of pi (z). The minimum possible value of |Λ| is the sum of the prime factors (with repetition) of K. Consider again the problem of representing a Uniform distribution over the set [100]. Here, K = 100, so Theorem 46 states that the minimum possible value of |Λ| is 14 in this situation since 100 = 2 × 2 × 5 × 5. Before proving 46, we require some intermediate results. Theorem 47 Let P = {pi (z) | 1 ≤ i ≤ N } be a set of Categorical distributions and define P z0 = N i=1 zi , zi ∼ pi (z). Then, p(z0 ) can be expressed as p(z0 ) = p1 (z0 ) ∗ p2 (z0 ) ∗ . . . ∗ pN (z0 ) where ∗ is the convolution operator. Theorem 47 is a well-known result from the statistics literature. Proposition 48 Let p1 (z) and p2 (z) be two Categorical distributions such that p(z) = p1 (z)∗p2 (z) is a Uniform distribution on the set [K]. Let Λi be the support set of pi (z). Then, p1 (z) = ∀z ∈ Λ1 , and p2 (z) = 1 |Λ2 | 1 |Λ1 | ∀z ∈ Λ2 . In other words, both p1 (z) and p2 (z) are Uniform distributions over the sets Λ1 and Λ2 , respectively. The proof of Proposition 48 can be found in [69]. Proposition 49 Let P = {pi (z) | 1 ≤ i ≤ N } be a set of Categorical distributions such that p(z) = p1 (z) ∗ p2 (z) ∗ . . . ∗ pN (z) is a Uniform distribution on the set [K]. Let Λi be the support of pi (z). Then, there exists a setting of P such that pi (z) = 1 |Λi | ∀z ∈ Λi , 1 ≤ i ≤ N (i.e., the pi (z) are Uniform distributions). Proof of Proposition 49 The result follows by mathematical induction and using Proposition 48 as the base case. Proposition 50 Let P = {pi (z) | 1 ≤ i ≤ N } be a set of Categorical distributions such that p(z) = p1 (z) ∗ p2 (z) ∗ . . . ∗ pN (z) is a Uniform distribution on the set [K]. Let P be specified such P that pi (z) = |Λ1i | ∀z ∈ Λi , 1 ≤ i ≤ N . Consider the quantity z0 = N i=1 zi , zi ∼ pi (z). For all possible values of z0 , there is a unique setting of the zi that sums to z0 . 116 Proof of Proposition 50 Consider the smallest possible value of z0 . Denote this value by z † . There is only one setting of the zi that sums to z † ; in particular, this setting is to choose the smallest possible value for each of Q 1 the zi . From Proposition 49, p(z † ) = N i=1 |Λi | . Suppose there was value of z0 for which there were multiple settings of the zi that sum to it. Denote this value by z ∗ . By Proposition 50, each setting of the zi that sums to z ∗ is drawn Q 1 ∗ with probability N i=1 |Λi | . Therefore, if there are m settings of the zi that sum to z , then Q 1 p(z ∗ ) = m N i=1 |Λi | . However, since P is specified so that p(z) is a Uniform distribution, then we must have p(z † ) = p(z ∗ ). This implies that m = 1, which implies that z ∗ has a unique setting of the zi that sums to it. We have arrived at a contradiction. Q Proposition 51 Suppose we express a number K as a product of natural numbers, K = N i=1 ai , PN ai ∈ N. Suppose we wish to minimize the quantity a0 = i=1 ai over all possible choices for N and all settings of the ai . Setting the ai to be the prime factors (with repetition) of K achieves the minimum value of a. Proof of Proposition 51 Suppose there exists a number K and a setting for the ai for which the proposition does not hold. Then, one of the ai must not be prime. Without loss of generality, suppose a1 is not prime. If a1 is not prime, then it must have at least two factors neither of which are 1. Denote these factors by c and P d. Suppose we replace a1 by cd. Let a∗0 = c + d + N i=2 ai , which represents the effect of replacing a1 by cd on a0 . Now, a0 − a∗0 = N X ai − (c + d + i=1 N X ai ) (10.14) i=2 = a1 − c − d (10.15) = cd − c − d (10.16) = (c − 1)(d − 1) − 1 (10.17) and so a0 − a∗0 ≥ 0 whenever c ≥ 2 and d ≥ 2, which implies a0 ≥ a∗0 whenever c ≥ 2 and d ≥ 2. This is a contradiction and the proposition must hold. We are now ready to prove Theorem 46. Proof of Theorem 46 117 From Theorem 47, p(z0 ) can be represented as p(z0 ) = p1 (z0 ) ∗ p2 (z0 ) ∗ . . . ∗ pN (z0 ). Let Λ0 denote the set of possible values of z0 . From Proposition 50, each possible value of z0 has a Q unique setting of the zi that sums to it. This implies that |Λ0 | = N i=1 |Λi |. Since p(z0 ) is distributed Q uniformly on the set [K], |Λ0 | = K. Therefore, we seek a setting for P such that K = N i=1 |Λi | PN and |Λ| = i=1 |Λi | is minimized. From Proposition 51, the minimum is achieve when P is set such that the |Λi | are the prime factors (with repetition) of K, and so the minimal value of |Λ| is the sum of the prime factors (with repetition) of K. We finish this subsection by giving a construction of P that realizes the minimum value of |Λ|. Proposition 52 Suppose we seek a set of Categorical distributions P = {pi (z) | 1 ≤ i ≤ N } such that p1 (z) ∗ p2 (z) ∗ . . . ∗ pN (z) is a Uniform distribution on the set [K]. Denote the support of pi (z) P by Λi , and let Λ = {Λi | 1 ≤ i ≤ N }. Let li = i−1 j=1 |Λj |. Consider the following specification of P. Set N to be the number of prime factors (with repetition) of K, set the |Λi | to be the prime factors (with repetition) of K, and set p1 (z) = Uniform({0, 1, . . . , |Λ1 | − 1}) and pi (z) = Uniform({0, li , 2li , . . . , (|Λi | − 1)li }) for 2 ≤ i ≤ N . Then, p(z) = p1 (z) ∗ p2 (z) ∗ . . . ∗ pN (z) is a Uniform distribution on the set [K] and out of all settings of P that represents a Uniform distribution on the set [K], the above construction achieves the minimum value for |Λ|. Proof of Proposition 52 First, p(z) represents a Uniform distribution over the set [K] since each value of z has a unique representation as a sum of the possible values of the zi , zi ∼ pi (z) Finally, since the |Λi | are set to be the prime factors of K (with repetition), by Theorem 46, the construction achieves the minimum value for |Λ|. Theorem 46 can be trivially extended to Uniform distributions on the set {N, N + 1, . . . , N + K − 1}. To construct a set of distributions P that represents such a Uniform distribution, use the construction described in Proposition 52 except set p1 (z) = Uniform({N, N +1, . . . , N +|Λ1 |−1}) We give some examples of using the construction described in Proposition 52 to represent a 1D Uniform distribution. Figures 10.2 and 10.3 show examples of this construction. As can be seen in the figures, the constructions are exact and according to Theorem 46, achieve the minimal value of |Λ|. 118 (a) pi (z), 1 ≤ i ≤ 4 for representing a Uniform distribution on the set [100] constructed using the process described in Proposition 52. Note that the support sizes are 2,2,5,5. (b) Probability distribution p(z) = p1 (z) ∗ p2 (z) ∗ p3 (z) ∗ p4 (z). Note that p(z) is uniform over the set [100]. Figure 10.2: Example of using the construction described in Proposition 52 to represent a Uniform distribution over the set [100]. 119 (a) pi (z), 1 ≤ i ≤ 8 for representing a Uniform distribution on the set [256] constructed using the process described in Proposition 52. Note that the support size of each distribution is 2. (b) Probability distribution p(z) = p1 (z) ∗ p2 (z) ∗ . . . ∗ p8 (z). Note that p(z) is exactly uniform over the set [256]. Figure 10.3: Example of using the construction described in Proposition 52 to represent a Uniform distribution over the set [256]. 120 10.5 Applications to PSG design In the PSG framework, a PSG G is associated with a factor graph F whose construction is described in Chapter 4. Suppose one wishes to specify a PSG G 0 such that its associated factor graph F 0 has fewer edges than F, but G 0 induces a similar distribution over scenes as G. In this section, given G, we outline a construction of such a grammar G 0 using the techniques developed in Sections 10.3 and 10.4. In particular, we will apply the approximation scheme described in Proposition 44 and the construction described in Proposition 52 to reduce the total size of the supports of the conditional pose distributions. Section 10.5.2 describes how the distributions over scenes induced by G and G 0 are related. 10.5.1 Constructing G 0 Informally, we construct a PSG G 0 from a G by adding symbols and modifying and adding compositional rules to G. Let G =(Σ, Ω, R, q, , γ) represent a PSG. Let the conditional pose distribution γ(ω,r,i) , ω ∈ ΩA(r,i) , r ∈ R, 1 ≤ i ≤ nr be a Categorical distribution. Suppose we wish to represent γ(ω,r,i) as a combination of distributions, as in Propositions 44 and 52 . If γ(ω,r,i) is an N -D Categorical distribution, apply the approximation described in Proposition 44 to approximate it as N 1-D Categorical distributions. If γ(ω,r,i) is a 1-D Uniform distribution, use the construction described in Proposition 52 to represent it as a set of Categorical distributions. In both cases, γ(ω,r,i) is replaced by a set of distributions P(ω,r,i) = {p(ω,r,i,j) | 1 ≤ j ≤ k(ω,r,i) } where k(ω,r,i) is the number of distributions γ(ω,r,i) is represented by. For simplicity, we assume that k(ω,r,i) is a constant with respect to ω. For brevity, we will denote k(ω,r,i) by k(r,i) ∀ω ∈ ΩA(r,i) . Below, let r = A0 → A1 , . . . , An be a rule. To denote the rule r, the rule selection probability, qr , and the conditional pose distributions γ(ω,r,i) for 1 ≤ i ≤ nr and a given pose ω ∈ ΩA0 , we write, qr , (A0 , ω) → (A1 , γ(ω,r,1) ), . . . , (An , γ(ω,r,n) ). (10.18) Definition 53 Given a PSG G =(Σ, Ω, R, q, , γ) , a rule r ∈ R and the i-th symbol in the RHS of r, consider replacing the conditional pose distribution γ(ω,r,i) , ω ∈ ΩA(r,i) , with a set of distributions P(ω,r,i) = {p(ω,r,i,j) | 1 ≤ j ≤ k(r,i) } ∀ω ∈ ΩA(r,i) . Let G 0 = (Σ0 , Ω0 , R0 , q 0 , 0 , γ 0 ) be a PSG that realizes such a replacement. Below is a process of constructing G 0 from G. The process is to be followed in the order given. 1. Create a set of fresh symbols B = {Bj | 1 ≤ j ≤ k(r,i) − 1}. 121 2. Set Ωb = ΩA(r,i) , ∀b ∈ B. 3. Set b = 0, ∀b ∈ B. 4. Assign Σ0 = Σ ∪ B. 5. Assign Ω0 = Ω 6. Assign 0 =  S S ΩB1 B1 S S ... ... S S ΩBk Bk (r,i) −1 (r,i) −1 . . 7. Assign R0 = R, q 0 = q, γ 0 = γ. 8. Rule r ∈ R0 has the form qr , (A0 , ω) → (A1 , γ(ω,r,1) ), . . . , (Ai , γ(ω,r,i) ), . . . , (An , γ(ω,r,n) ). Replace it by the rule qr , (A0 , ω) → (A1 , γ(ω,r,1) ), . . . , (B1 , p(ω,r,i,1) ), . . . , (An , γ(ω,r,n) ). 9. Add rules of the form 1.0, (Bj−1 , ω) → (Bj , p(ω,r,i,j) ), 2 ≤ j ≤ k(r,i) − 1, to G 0 . 10. Add a rule of the form 1.0, (Bk(r,i) −1 , ω) → (A(r,i) , p(ω,r,i,k(r,i) ) ) to G 0 . The transformation process described in Definition 53 can be repeated for as many pairs (r, i), r ∈ R, 1 ≤ i ≤ nr as one desires. As we will see in the next subsection where we give examples of using the process above, this transformation process can reduce the number of factor graph edges used to represent the PSG. 10.5.2 Examples: transformation of grammars Consider the PSG below that models faces and noses. Grammar 10 A grammar that models faces and noses in an N × M scene: 122 Σ = {FACE, NOSE}. ∀A ∈ Σ, ΩA = [N ] × [M ]. Rules: 1.0, (FACE, ω) → (NOSE, UniformRect(ω − (12, 12), ω + (12, 12))) 1.0, (NOSE, ω) → ∅ FACE = NOSE = 10−4 , We apply Proposition 53 to transform Grammar 10 into Grammar 11 below. In particular, we apply the approximation described in Proposition 44 to rule 1 and the first RHS symbol to factorize its associated 2-D distribution conditional pose distribution. Grammar 11 A transformation of Grammar 10 for an N × M scene using Proposition 44: Σ = {FACE, NOSE, NOSE-Y}. ∀A ∈ Σ, ΩA = [N ] × [M ]. Rules: 1.0, (FACE, ω) → (NOSE-Y, Uniform({ω + (0, −12), . . . , ω + (0, 12)})) 1.0, (NOSE-Y, ω) → (NOSE, Uniform({ω + (−12, 0), . . . , ω + (12, 0)})) 1.0, (NOSE, ω) → ∅ −4 FACE = NOSE = 10 , NOSE-Y = 0. Consider expanding a FACE brick using Grammar 11. Rules 1 and 2 model a FACE brick generating a NOSE brick. In particular, rules 1 and 2 model the sequential process of a FACE brick first choosing a Y coordinate for the NOSE brick, then choosing an X coordinate for the NOSE brick. Finally, we apply the process described in Definition 53 to transform Grammar 11 into Grammar 12 below. In particular, we use the construction described in Proposition 52 to the first RHS symbol of rules 1 and 2. Grammar 12 A transformation of Grammar 11 for an N × M scene using Proposition 52: 123 Σ = {FACE, NOSE, NOSE-Y, NOSE-Y1, NOSE-Y2}. ∀A ∈ Σ, ΩA = [N ] × [M ]. Rules: 1.0, (FACE, ω) → (NOSE-Y1, Uniform({ω + (0, −12), . . . , ω + (0, −8)})) 1.0, (NOSE-Y1, ω) → (NOSE-Y, Uniform({ω + (0, 0), ω + (0, 5), . . . , ω + (0, 20)})) 1.0, (NOSE-Y, ω) → (NOSE-Y2, Uniform({ω + (−12, 0), . . . , ω + (−8, 0)})) 1.0, (NOSE-Y2, ω) → (NOSE, Uniform({ω + (0, 0), ω + (5, 0), . . . , ω + (20, 0)})) 1.0, (NOSE, ω) → ∅ FACE = NOSE = 10−4 , NOSE-Y = NOSE-Y1 = NOSE-Y2 = 0. Consider expanding a FACE brick using Grammar 12. Rules 1-4 model a FACE brick generating a NOSE brick. In particular, rules 1 and 2 model the process of choosing a Y coordinate for the NOSE brick in terms of a two-stage process, and rules 3 and 4 model the process of choosing an X coordinate for the NOSE brick in terms of a two-stage process. Table 10.3 summarizes the effect of recursively applying the construction process outlined in Section 10.5.1 to Grammar 10 on the number of factor graph edges. Note that Grammar 11 has more than an order of magnitude fewer edges in its associated factor graph than Grammar 10, and Grammar 12 has even fewer edges than Grammar 11. Recall from Chapter 5 that the run time of one iteration of LBP is linear in the number of edges in the factor graph. Thus, recursively applying the techniques described in Sections 10.3 and 10.3 on Grammar 12 confers an approximately 21x speed-up. Grammar Grammar 10 Grammar 11 Grammar 12 Number of edges in the factor graph 1258N M 112N M 60N M Table 10.3: Number of edges in the factor graph of the Grammar models described in this section. Recall that the grammar consider an N × M scene. Next, we study the distribution over scenes induced by Grammars 10, 11, and 12. In particular, we show that the distribution over scenes is not the same and show a scenario in which they are not the same. Figure 10.4 shows the approximate marginals computed by running LBP on the factor graph representations of Grammars 10, 11, and 12 for several examples. Note that the approximate marginals are similar across all examples for the three grammars. This suggests that in practice, any of Grammars 10, 11, or 12 can be used in place of any of the other grammars. However, the 124 computed approximate marginals are not the same for all grammars. Let p̂0A , p̂1A , and p̂2A denote the largest approximate marginals computed by LBP for a symbol A ∈ {FACE, NOSE} for Grammars 10, 11, and 12, respectively. Table 10.4 lists the ratios Ratio p̂0FACE /p̂1FACE p̂0NOSE /p̂1NOSE p̂0FACE /p̂2FACE p̂0NOSE /p̂2NOSE Example: Figure 10.4(a) 1.0000 1.0001 1.0000 1.0000 p̂0A p̂1A Example: Figure 10.4(b) 1.0000 1.0185 1.0000 1.0187 and p̂0A p̂1A A ∈ {FACE, NOSE}. Example: Figure 10.4(c) 1.0000 1.0000 1.0023 1.0001 Example: Figure 10.4(d) 1.0000 1.0095 1.0000 1.0097 Table 10.4: p̂0A , p̂1A , and p̂2A denote the largest approximate marginals computed by LBP for a symbol A ∈ {FACE, NOSE} for Grammars 10, 11, and 12, respectively. Some ratios of these approximate marginals are shown in the table. As shown, the largest approximate marginals computed by LBP are not identical across the grammars, suggesting that each of the grammars models induces a different distribution over scenes. The ratios in Table 10.4 show that the approximate marginals produced by LBP are not identical, suggesting that Grammars 10, 11, and 12 each induce a slightly different distribution over scenes. In general, a PSG G and a transformation of it G 0 constructed using the process described in Section 10.5.1 do not induce the same distribution over scenes. We demonstrate this fact below for Grammars 10, 11, and 12. Consider two bricks (FACE, ω) and (FACE, ω + (0, 1)). Suppose these bricks are present in a scene, and consider the probability of the event that a sequence of expansions for both bricks leads to the generation of a single NOSE brick (i.e., their expansions “collide” and only one NOSE brick is generated instead of two). For Grammar 10, the probability of this event is 600 625 × 1 625 = 600 . 6252 For Grammars 11 and 12, two FACE bricks will generate the same NOSE brick if and only if they generate the same NOSE-Y brick. So, the probability of the event is and 4 5 × 1 5 = 4 25 24 25 × 1 25 = 24 252 for Grammar 11, for Grammar 12. Thus, the probability of the event that the sequence of expansions for both bricks leads to the generation of a single NOSE brick is different for Grammars 10, 11, and 12. In general, the probability that a sequence of expansions for two bricks b1 and b2 leads to the generation of a common brick (i.e., their expansions “collide”) is a function of 1) the number of bricks b1 can generate, 2) the number of bricks b2 can generate, and 3) the number of common bricks b1 and b2 can generate. As such, any PSG transformation that changes these three quantities may change the distributions over scenes induced by the PSG. 125 (a) (b) (c) (d) Figure 10.4: A visualization of the approximate marginals computed by LBP when conditioning on sets of bricks. Each subfigure represents a different example, and the top, middle, and bottom rows of each subfigure show the results of running LBP on the factor graph representation of Grammars 10, 11, and 12, respectively. We show only the approximate marginals for the FACE and NOSE bricks. Each pixel represents a brick at that pixel location. The bricks conditioned to be present in the image are denoted by a red pixel and a red arrow pointing to them. Darker pixels indicate a higher approximate marginal probability of being present. Note that for all examples, the approximate marginals produced by LBP are similar across the different grammars. 126 10.6 Notes on 1-D Uniform distributions with a prime support size Given Theorem 46, one may be concerned that it is not possible to represent a 1-D Uniform distribution p(z) with a prime support size in a manner that will result in a reduction in the number of edges of the PSG factor graph. For example, consider a Uniform distribution over [101]; Theorem 46 indicates that expressing this distribution as a convolution of a set of 1-D Categorical results in no computational savings since 101 is a prime number. To deal with this scenario, one can partition the support of p(z) into L contiguous partitions, express a Categorical distribution over these partitions with each partition being chosen proportional to size of the partition, and apply the decomposition in Proposition 52 to represent a Uniform distribution over the elements of each of the L partitions. The process to sample from p(z) with support of size K given a partitioning of the support proceeds as follows: first, choose a partition with probability |Λi | K where |Λi | is the size of partition i. Then, sample an element from the selected partition uniformly using the representation described in Proposition 52. The total support of this P representation for p(z) is L + L i=1 |Λi | where |Λi | is the size of partition i. As an example, let p(z) be a Uniform distribution over the set {0, . . . , 100}. One can partition the support into two sets, Λ1 = {0, . . . , 50} and Λ2 = {51, . . . , 100}. A Uniform distribution over the elements of each of these partitions can be represented using the decomposition in Proposition 52 . To sample from p(z), first choose either L1 or L2 with probability 51 101 and 50 101 , respectively. Then, sample an element from the selected partition uniformly using the representation described in Proposition 52 . The total support of representing a Uniform distribution over the set {0, . . . , 100} in this fashion is 2 + 20 + 12 = 34. This technique can be applied even if the support of p(z) is not prime. The solution to finding an optimal partitioning of the support of p(z) to minimize the total support of the representation scheme given here is currently unknown and is a direction for future research. 10.7 Notes on general 1-D Categorical distributions The techniques to decompose a 1-D distribution outlined in Section 10.4 only apply to 1-D Uniform distributions. In the case of a general 1-D Categorical distribution, one cannot use the aforementioned techniques. The general problem of decomposing a general 1-D Categorical distribution into a set of 1-D Categorical distributions with smaller total support is related to the problem of blind deconvolution (see [37]) with a prior that encourages sparsity in the composing distributions. This general problem is more difficult than the special case considered in this thesis where the 1-D distribution is Uniform. The solution to this more general problem is a goal of future research. 127 In this work, the task of decomposing a 1-D Categorical distribution into a set of 1-D Categorical distributions with smaller total support is motivated by a desire to define a PSG that gives rise to a factor graph with a smaller number edges. If one wishes to use a particular 1-D Categorical distribution to define a PSG, presumably that distribution was estimated from training data. Instead of decomposing a given Categorical distribution directly, one can first define a family of 1-D Categorical distributions parameterized by the parameters of N 1-D Categorical distributions with fixed support. Then, one can express this decomposition directly in a PSG as was done in Section 10.5.2. Finally, one can fit the parameters of the N 1-D Categorical distributions with the learning algorithm defined in Chapter 8 using the training data. This process results in a PSG with a smaller number of factor graph edges and an approximation to the target 1-D Categorical distribution in terms of a convolution of N 1-D Categorical distributions. Providing a detailed analysis of the goodness of the resulting approximation is a goal of future research. Chapter 11 Contributions and Suggestions for Future Research To conclude this thesis, we summarize the approach of the PSG framework, outline the research contributions, and suggest directions of future research that builds on the PSG framework. 11.1 Summary of approach and research contributions In this thesis, we have introduced the Probabilistic Scene Grammar (PSG) framework: a generalpurpose probabilistic framework for scene understanding tasks. For a given scene understanding task, we summarize the approach of the PSG framework below: 1. Represent a model for the given scene understanding task in the language of a grammar; Chapter 2 defines the grammar specification, and Chapter 3 gives examples of grammars for some scene understanding tasks. 2. Convert the grammar representation to a factor graph, as described in Chapter 4. 3. Directly estimate model parameters if fully-observed data is available. Otherwise, use the approximate EM learning algorithm described in Chapter 8 to estimate model parameters. 4. Use Loopy Belief Propagation (LBP) as the inference engine, as described in Chapter 5. Importantly, the approach of the PSG framework is the same no matter the scene understanding task at hand. In theory, any scene understanding for which a suitable model can be expressed in the grammar language defined in Chapter 2 may be addressed in the PSG framework. However, practical limitations of the PSG of the framework may prevent its application to arbitrary scene 128 129 understanding tasks with arbitrary grammar models. Chapter 10 takes some steps towards addressing practical issues that must be resolved to enable the use of larger PSG models on more complex scene understanding tasks. Chapter 9 evaluates the PSG framework on the scene understanding tasks of contour detection, face localization, and binary image segmentation. The PSG framework is competitive with specialized algorithms for these scene understanding tasks. The main contributions of this thesis can be summarized as addressing four key aspects of defining and assessing a general-purpose probabilistic framework for scene understanding: 1) the representation of scene understanding tasks under a common schema (Chapters 2, 3 and 4), 2) efficient, problem-agnostic approximate inference (Chapters 5, 6, and 10), 3) the learning of model parameters under varying levels of supervision (Chapter 8), and 4) the experimental evaluation of the framework (Chapter 9). The concertization of this framework in a general implementation is a final, engineering-oriented contribution. 11.2 Directions for future research In this section, we discuss promising directions for future research that build on the PSG framework. The proposed directions deal with issues concerning 1) the use of richer data models (e.g., data models defined by deep learning models), 2) the application of the PSG framework to more scene understanding tasks, and 3) the practical limitations of the PSG framework in its current form. 11.2.1 Integration of deep learning models In recent years, the approach of deep learning has demonstrated impressive results on scene understanding tasks (see [50, 66, 36, 32, 46, 21] for a few examples). In a sense, deep learning is a general-purpose scene understanding framework as well in the sense that deep learning seeks to learn a mapping between inputs (e.g., images) and outputs (e.g., class labels, segmentations, object localizations, etc.). It is crucial to note that deep learning does not have to be viewed as a competitor to probabilistic models; both can be used together in a coherent system. The emerging subfield of Bayesian Deep Learning (see [64] for a brief survey) seeks to combine probabilistic approaches with deep learning techniques. In the context of the PSG framework, one could imagine using the output of a convolution neural network (CNN) such as the one described in [32] as a data model. We believe such an approach could combine the ability of deep learning to produce excellent low-level representations with the high-level reasoning ability of the PSG framework. Such a combination could allow one to tackle 130 scene understanding tasks that are currently difficult for both approaches. Consider, for example, the problem of detecting conversations in scenes when one has many examples of faces but few examples of conversations. The notion of a conversation is naturally modeled in a compositional relationship: a conversation can be thought of as a composition of faces that are facing each other and in close spatial proximity. Such compositional models can be naturally expressed in the PSG framework. Deep learning is capable of building a high-quality representation of objects when one has many examples of that object. In this example, deep learning could be used to build an excellent face-detector, but perhaps cannot be used to build a conversation-detector. One could use deep learning to detect faces, and the PSG framework to detect conversations using the face detections as an input. 11.2.2 Applications to more scene understanding tasks In this thesis, we evaluate the PSG framework on the scene understanding tasks of face localization, contour detection, and binary image segmentation. As the PSG framework is general-purpose, there is a myriad of other scene understanding tasks that can be addressed in this framework. For example, the PSG framework can be used for motion tracking. The concept of motion can be naturally expressed in the PSG framework. Consider the problem of tracking a face through time. Recall that the face models we describe in Chapter 3 describe the location of faces and face parts in terms of spatial location. One could extend the model to include both spatial and temporal information. For example, a FACE brick at location (x, y) at time t could generate a FACE brick at location (x, y) at time t + 1. The PSG framework could also be applied to larger and more complex scene understanding tasks. Consider the problem of localizing tumours from magnetic resonance imaging (MRI) brain scans. MRI brain scans are volumetric 3-D scans of a patient’s brain. These scans can be fairly large; a typical scan may contain 200 × 200 × 144 measurements. Given a brain scan, one seeks to output a 3-D segmentation of the brain that localizes tumours in the brain, if any. Here, the image data is the MRI scan which has two orders of magnitude more measurements than any of the scene understanding tasks we address in this thesis, and so speed/memory issues may arise. Also, the shape of a tumour can be quite complex, necessitating a more sophisticated notion of shape than the one used in Chapter 9 for leaf segmentation. Nevertheless, we believe tackling such large, complex scene understanding tasks is in the realm of possibility. 131 11.2.3 Structure learning Recall from Chapter 8 that in this thesis, we propose a method to estimate model parameters of a PSG. However, we have not proposed a method to learn the structure of the grammar itself. For example, the PSG we use for face localization described in Chapter 9 has a notion that a FACE is comprised of a LEFT-EYE, RIGHT-EYE, NOSE, and a MOUTH. What if we did not know apriori that faces had this compositional structure? What if we did not know the parts of a face? In the context of the PSG framework, addressing such questions requires learning the compositional rules of the model and learning an appropriate set of symbols. The study of learning structure in compositional models has been examined in [24] and [54]. However, it may not be practical to directly apply the techniques of [24] and [54] to the PSG framework. The problem of efficiently learning the structure of a PSG model is difficult and is a key question one must solve before applying the PSG framework to scene understanding tasks where one cannot rely on expert knowledge to design the structure of the grammar. Bibliography [1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, May 2011. [2] Julian Besag. On the statistical analysis of dirty pictures. JOURNAL OF THE ROYAL STATISTICAL SOCIETY B, 48(3):48–259, 1986. [3] Elie Bienenstock, Stuart Geman, and Daniel Potter. Compositionality, MDL priors, and object recognition. In Advances in Neural Information Processing Systems, pages 838–844, 1997. [4] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [5] Yuri Boykov and Gareth Funka-Lea. Graph cuts and efficient n-d image segmentation. Int. J. Comput. Vision, 70(2):109–131, November 2006. [6] Yuri Y. Boykov and Marie-Pierre Jolly. Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images. In 8th IEEE International Conference on Computer Vision, volume 1, pages 105–112. IEEE Comput. Soc, 2001. [7] J Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 8(6):679–698, June 1986. [8] Gregory F. Cooper. The computational complexity of probabilistic inference using bayesian belief networks. Artificial Intelligence, 42(2):393 – 405, 1990. [9] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 886–893, 2005. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1–38, 1977. 132 133 [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html. [12] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latent-release4/. [13] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61(1):55–79, 2005. [14] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient belief propagation for early vision. Int. J. Comput. Vision, 70(1):41–54, October 2006. [15] Pedro F. Felzenszwalb and David McAllester. Object detection grammars. Univerity of Chicago Computer Science Technical Report 2010-02, 2010. [16] Pedro F. Felzenszwalb and John G. Oberlin. Multiscale fields of patterns. In Advances in Neural Information Processing Systems, pages 82–90, 2014. [17] Sanja Fidler, Marko Boben, and Aleš Leonardis. Learning a hierarchical compositional shape vocabulary for multi-class object representation. In ArXiv:1408.5516, 2014. [18] Martin A. Fischler and Robert A. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on computers, (1):67–92, 1973. [19] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980. [20] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741, November 1984. [21] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [22] Ross B. Girshick, Pedro F. Felzenszwalb, and David Mcallester. Object detection with grammar models. In Advances in Neural Information Processing Systems, pages 442–450, 2011. [23] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In Proceedings of the Tenth IEEE International Conference on 134 Computer Vision - Volume 2, ICCV ’05, pages 1458–1465, Washington, DC, USA, 2005. IEEE Computer Society. [24] Matthew T. Harrison. Discovering compositional structures. Technical report, 2005. [25] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, April 1970. [26] Tom Heskes, Onno Zoeter, and Wim Wiegerinck. Approximate expectation maximization. In S. Thrun, L. K. Saul, and P. B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 353–360. MIT Press, 2004. [27] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. [28] Ya Jin and Stuart Geman. Context and hierarchy in a probabilistic image model. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 2145–2152, 2006. [29] Zoltan Kato and Ting-Chuen Pong. A markov random field image segmentation model for color textured images. Image and Vision Computing, 24(10):1103–1114, 2006. [30] Vladimir Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1568–1583, October 2006. [31] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 109–117. Curran Associates, Inc., 2011. [32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, page 2012. [33] Frank R Kschischang, Brendan J Frey, and Hans-Andrea Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001. [34] Tejas D. Kulkarni, Pushmeet Kohli, Joshua B. Tenenbaum, and Vikash K. Mansinghka. Picture: A probabilistic programming language for scene perception. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4390–4399, 2015. 135 [35] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, pages 2169–2178, Washington, DC, USA, 2006. IEEE Computer Society. [36] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. [37] Anat Levin, Yair Weiss, Fredo Durand, and William T. Freeman. Understanding blind deconvolution algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 33(12):2354–2367, December 2011. [38] Talya Meltzer, Amir Globerson, and Yair Weiss. Convergent message passing algorithms: A unifying view. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 393–401, Arlington, Virginia, United States, 2009. AUAI Press. [39] Thomas P. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [40] David Mumford. Optimal approximation by piecewise smooth functions and associated variational problems. Commun. Pure Applied Mathematics, pages 577–685, 1989. [41] David Mumford. Elastica and computer vision. In Algebraic geometry and its applications, pages 491–506. Springer, 1994. [42] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence, pages 467–475, 1999. [43] Radford Neal. Slice sampling. Annals of Statistics, 31:705–767, 2000. [44] Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010. [45] Stephen E Palmer. Vision science: Photons to phenomenology, volume 1. MIT press, 1999. [46] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015. 136 [47] Zhile Ren and Erik B. Sudderth. Three-dimensional object detection and layout prediction using clouds of oriented gradients. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1525–1533, 2016. [48] Daniel Ritchie. Probabilistic Programming for Procedural Modeling and Design. PhD thesis, Stanford University, 2016. [49] Florian Schroff, Antonio Criminisi, and Andrew Zisserman. Object class segmentation using random forests. In Proc. British Machine Vision Conference (BMVC), January 2008. [50] Wei Shen, Xinggang Wang, Yan Wang, Xiang Bai, and Zhijiang Zhang. Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3982–3991, 2015. [51] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, August 2000. [52] Douglas Shy and Pietro Perona. Xy separable pyramid steerable scalable kernels. In CVPR, pages 237–244, 1994. [53] Oskar Söderkvist. Computer vision classification of leaves from swedish trees. Master’s thesis, 2001. [54] Andreas Stolcke. Bayesian learning of probabilistic language models. Technical report, 1994. [55] Thomas M. Strat. Employing contextual information in computer vision. In In Proceedings of ARPA Image Understanding Workshop, pages 217–229, 1993. [56] Erik B. Sudderth, Alexander T. Ihler, Michael Isard, William T. Freeman, and Alan S. Willsky. Nonparametric belief propagation. Commun. ACM, 53(10):95–103, 2010. [57] Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, and Michael J. Black. A fully-connected layered model of foreground and background flow. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 2451–2458, 2013. [58] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016. 137 [59] Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. Image parsing: Unifying segmentation, detection, and recognition. International Journal of computer vision, 63(2):113– 140, 2005. [60] Luminita A Vese and Tony F Chan. A multiphase level set framework for image segmentation using the mumford and shah model. International journal of computer vision, 50(3):271–293, 2002. [61] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition, 2015. [62] Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, and Antonio Torralba. Hoggles: Visualizing object detection features. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV ’13, pages 1–8, Washington, DC, USA, 2013. IEEE Computer Society. [63] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, January 2008. [64] Hao Wang and Dit-Yan Yeung. Towards bayesian deep learning: A framework and some existing methods. IEEE Trans. on Knowl. and Data Eng., 28(12):3395–3408, December 2016. [65] Yair Weiss. Comparing the mean field method and belief propagation for approximate inference in mrfs, 2001. [66] Jimei Yang, Brian L. Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object contour detection with a fully convolutional encoder-decoder network. CoRR, abs/1603.04530, 2016. [67] Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2005. [68] Yibiao Zhao and Song-Chun Zhu. Image parsing with stochastic scene grammar. In Advances in Neural Information Processing Systems, pages 73–81, 2011. [69] Anatoly Zhigljavsky, Nina Golyandina, and Svyatoslav Gryaznov. Deconvolution of a discrete uniform distribution. Statistics and Probability Letters, 118:37 – 44, 2016. [70] Song-Chun Zhu and David Mumford. A stochastic grammar of images. Found. Trends. Comput. Graph. Vis., 2(4):259–362, January 2006.