Abstract of “Reliable and scalable variational inference for nonparametric mixtures, topics, and se-
quences” by Michael C. Hughes, Ph.D., Brown University, May 2016.


We develop new algorithms for training nonparametric clustering models based on the Dirichlet
Process (DP), including DP mixture models, hierarchical Dirichlet process (HDP) topic models, and
HDP hidden Markov models. This family of models has been widely used because Bayesian non-
parametric posterior distributions allow coherent comparison of different numbers of clusters for a
given fixed dataset. The nonparametric approach is particularly promising for large-scale or stream-
ing applications, where other model selection techniques like cross-validation are far too expensive.
However, existing training algorithms fail to live up to this promise. Both Monte Carlo samplers
and variational optimization methods are vulnerable to local optima and sensitive to initialization,
especially the initial number of clusters. Our new algorithms can reliably escape local optima and
poor initializations to discover interpretable clusters from millions of training examples.
   For the basic DP mixture model, we pose a variational optimization problem in which the
number of instantiated clusters assigned to data can be adapted during training. The focus of
this optimization is an objective function which tightly lower bounds the marginal likelihood and
thus can be used for Bayesian model selection. Our algorithm maximizes this objective score via
block coordinate ascent interleaved with proposal moves that can add useful clusters to escape
local optima while removing redundant or irrelevant clusters. We further introduce an incremental
algorithm that can exactly optimize our objective function on large datasets while processing only
small batches at each step. Our approach uses cached or memoized sufficient statistics to make
exact decisions for proposal acceptance or rejection. This memoized approach has the same runtime
cost as previous stochastic methods but allows exact acceptance decisions for cluster proposals and
avoids learning rates entirely.
   We later extend these algorithms to HDP topic models and HDP hidden Markov models. Previ-
ous methods for the HDP have used zero-variance point estimates with problematic model selection
properties. Instead, we find sophisticated solutions to the non-conjugacy inherent in the HDP that
still yield an optimization objective function usable for Bayesian model selection. We demonstrate
promising proposal moves for adapting the number of clusters during memoized training on millions
of news articles, hundreds of motion capture sequences, and the human genome.
   Finally, we show that introducing an additional sparsity constraint to the variational optimization
problem for local cluster assignments leads to speed gains without sacrificing model quality. Look-
ing forward, we anticipate possible memoized variational algorithms with adaptive proposal moves
for a broad family of Bayesian nonparametric clustering models and suggest potential theoretical
guarantees on approximation quality based on data-driven initializations of proposals.
Reliable and scalable variational inference for nonparametric mixtures, topics, and sequences


                                             by
                                     Michael C. Hughes
                    B. S., Franklin W. Olin College of Engineering, 2010
                              Sc. M., Brown University, 2012


                    A dissertation submitted in partial fulfillment of the
                    requirements for the Degree of Doctor of Philosophy
               in the Department of Computer Science at Brown University


                                 Providence, Rhode Island
                                          May 2016
© Copyright 2016 by Michael C. Hughes
         This dissertation by Michael C. Hughes is accepted in its present form by
       the Department of Computer Science as satisfying the dissertation requirement
                          for the degree of Doctor of Philosophy.


Date
                                              Erik B. Sudderth, Director


                          Recommended to the Graduate Council


Date
                                             Benjamin Raphael, Reader
                                    Dept. of Computer Science, Brown University


Date
                                               Emily B. Fox, Reader
                                    Dept. of Statistics, University of Washington


                            Approved by the Graduate Council


Date
                                                 Peter W. Weber
                                            Dean of the Graduate School


                                            iii
Vitae

Michael C. Hughes was born on August 2, 1987, in Englewood, Colorado, USA. His family moved to
Billings, Montana in 1994 and he received his entire primary and secondary education from Billings
public schools, graduating from Billings Senior High in 2006.
   He attended Franklin W. Olin College of Engineering in Needham, Massachusetts, graduating
in 2010 with a B.S. in Electrical and Computer Engineering. He immediately enrolled in graduate
studies at Brown University’s Department of Computer Science. He earned his Master’s degree from
Brown in 2012 and his Ph.D. in 2016. He gratefully acknowledges funding support from the National
Science Foundation Graduate Research Fellowship.


Publications related to this thesis
Michael C. Hughes, William Stephenson, and Erik B. Sudderth. Scalable Adaptation of State Com-
plexity for Nonparametric Hidden Markov Models. Neural Information Processing Systems (NIPS),
2015.


Michael C. Hughes, Dae Il Kim, and Erik B. Sudderth. Reliable and Scalable Variational Inference
for the Hierarchical Dirichlet Process. Artificial Intelligence & Statistics (AISTATS), 2015.


Michael C. Hughes and Erik B. Sudderth. Memoized Online Variational Inference for Dirichlet
Process Mixture Models. Neural Information Processing Systems (NIPS), 2013.


Other relevant publications
Emily B. Fox, Michael C. Hughes, Erik B. Sudderth, and Michael I. Jordan. Joint Modeling of
Multiple Time Series via the Beta Process with Application to Motion Capture Segmentation. Annals
of Applied Statistics, Vol. 8(3), 2014.


Michael C. Hughes, Emily Fox, and Erik B. Sudderth Effective Split-Merge Monte Carlo Methods for
Nonparametric Models of Sequential Data. Neural Information Processing Systems (NIPS), 2012.


                                                 iv
                           Dedicated to my parents


   For my mother, Lorie Hughes, my first teacher. She taught me how to read and
write, but also to pursue what is right, even if it means staying long after everyone
else leaves.


   For my father, Gary Hughes, who was the first scientist I ever met and still the
one I look up to most. He taught me to ask questions and work hard, but also not
to take myself too seriously. Hook, you’re a codfish.


                                         v
Acknowledgements

First and foremost, my thanks go to Erik Sudderth, who was crazy enough to take me on as an advisee
from a small school no one had ever heard of and without a lick of machine learning experience. I
could not even pronounce the word “Dirichlet” the first time I walked in his door. As a mentor, he
imparted on me the importance of pursuing elegant models and rigorous experiments. I am grateful
for his seemingly endless patience in guiding me whenever I floundered along the way. Such patience
is a rare quality in advisers with his level of brilliance. Finally, I am thankful for the work-life
balance that he modeled during my graduate career. People always say that you will become more
of an expert than your adviser on your thesis topic, but I only hope I can be half the mentor he is
someday.
   Thanks go also to my readers and faculty mentors. First, thanks to Emily Fox, who offered crucial
guidance during my early years of graduate school as well as during my final year job search. It is
very rare these days – and much appreciated – to find a co-author who will print out a manuscript
and mark it up by hand. Next, thanks to Ben Raphael for suggesting the chromatin segmentation
problem and for many productive discussions at research meetings and in his seminar course. Along
with all the faculty I encountered at Brown, I’d like to give a special word of thanks to James
Hays, Stefanie Tellex, Michael Littman, Eugene Charniak, and Michael Black for discussions and
coursework that shaped how I approach research problems. Finally, thanks to Tom Ouyang and
Marc Stogaitis for a productive summer away at Google back in 2013.
   I chose Brown for graduate school because I thought it had an extremely supportive community
of fellow students, and I was not disappointed. My thanks go to so many: to Soumya Ghosh who
convinced me that Erik’s group was the real deal and who was a fellow Broncos fan deep in the
heart of the empire; to Dae Il Kim for his contagious enthusiasm for topic models and for always
knowing where the party was at from Tahoe to Edinburgh; to Jason Pacheco who was always the first
person to ask about anything in machine learning and a genuine friend; to Layla Oesper who kept
me sane with homemade calzones and who knows when to say “Calamity Jane”; to Ryan Cabeen
for many discussions about research and many hours of frisbee, Jurassic Park, and beer; to Silvia
Zuffi who was the most supportive officemate I could ask for; to Anna Ritz who helped me make
important life decisions over coffee and brunch; to Zhile Ren, who always helped me think about
reaching a bigger audience, to Ben Swanson whose zany banjo antics and graphical model grafitti
convinced me to come to Brown in the first place; to Deqing Sun for showing the way in vision

                                                 vi
reading group, especially in how to give engaging talks; to Betsy Hilliard, who became a staunch
friend and confidant despite the soda explosion; to Scott Wylie, who led countless relaxing diversions
involving disc golf, rope swings, and lady ducks; and to Jesse Butterfield, who taught me to bet it
all on black and introduced me to important Providence landmarks like the GCB and Hot Club.
Finally, thanks to Andy and Kristin Loomis, Eric Sodomka, Joe Politz, Ben Lerner, Mike Bryant,
Alex Gillmor, Seth Goldenburg, Connor Gramazio and the rest of INEB reading group for helping
me de-stress every Wednesday afternoon.
   Thanks also to the crew of students I had the pleasure to mentor the past few years. To Geng Ji,
who derived variational algorithms for more models in his first two years than I have in my whole
career; to Will Stephenson, who proved the HDP surrogate bound and was the first guinea pig for
my Python code; to Sonia Phene, who helped me figure out parallelization; to Leah Weiner, who
can do great things will topic models; to Gabe Hope, who bravely combined my code with Gaussian
processes; to Jake Soloff, who mastered mixture-of-Gaussian observation models; and to Alexis
Cook, who will carefully rederive any equation put in front of her. Many others deserve credit too,
including Jincheng Li, Mengrui Ni, Oussama Fadil, and Mert Terzihan. The BNPy project would
not be where it is today without them to debug my code and keep my crazy ideas in check.
   Even before I made it to graduate school, I had many important teachers who taught me even
more important lessons . First, from my pre-college years, I give thanks to K. Stuart Smith, who
introduced me to computer science and taught me that good food and good conversation is the
best way to spend an evening; to Jacquie MacDonald, who taught calculus in public high school
better than anyone; to Heather Oberdeck, who helped me realize how to be an adult without losing
the best things about being a kid; to Ms. Linda Horst, who showed me in 8th grade just how far
mathematics could take me; and to Ms. Robertson, who convinced me that joining math club in
7th grade was not the end of my social life. Boy, was she right. Next, from my undergraduate days
at Olin College, I say thanks to Matthew Jadud, who taught me how to tackle research problems on
my own; to Lynn Andrea Stein, whose keen advice about graduate school helped me choose Brown
in the first place; to John Geddes, who teaches writing even better than he teaches mathematics;
and to Allen Downey, whose books continue to inspire me.
   I am grateful to many personal friends for their support during the last few years and throughout
my life. There are too many to name, but I will try my best. To Kyle Bjordahl, who has been a
partner-in-crime, business co-founder, best man and brother since first grade; to Kayton Parekh and
Matt Helgeson, who know what do to when a chicken is on deck and Dutch gold is on the line; to
Marc Sweetgall, who still holds the record for most nights spent on my couch; to Mike Roenbeck,
who knows the key to graduate school stress is a crunchwrap; to Jessi Murray, who could never
resist a good pie fight; to Brian Fahrenbach, who is a fellow lifelong member of the local 316, to
Pam Heidt, who was always better than me at dodgeball; to Amanda Pratt, who put the electrical
in my ECE degree and was there to watch the sun rise on Half Dome; to Casey Canfield, who is
always there when you need her, to Katie Miller and George Sass, who have never been anything
but generous; to Ben and Karen Salinas, who do the best karaoke in town; and to Sarah and Stefan

                                                 vii
Wolpert, who know how to pack a Yeti 110 for a float down the river.
   Thanks to my parents, Lorie and Gary Hughes, to whom I dedicate this thesis. Growing up, I
always wanted to go to graduate school because they made it sound so much fun. In this and so
many other things, they were right. For my older sister, Sarah, who was always helpful crafting
posters for my childhood science projects and who also made sure I didn’t turn out to be a complete
nerd. For my younger sister, Amy, who has always been there to make me laugh and helped me
navigate the grind of graduate school. You’re killing me, smalls. Next, many thanks to my newest
family members: my father-in-law Kailash Mutha, who has been a constant source of wisdom and
advice; to Sarita, who taught me the best way to celebrate is with a limosine; and to Sulochana,
who was there for many hikes and who knows the value of a good nap.
   Finally, my deepest thanks to my wife and best friend, Heena. She was there at the start to
encourage me to apply to graduate school, there in the middle of it to keep me well-rounded and
well-fed, and here at the end to help me celebrate. No one has my back like she does. We’ve had
many adventures during these last six years: climbing mountains, rafting rivers, chasing sloths, and
falling into sewers. Thanks for all the love and laughter along the way. My future is yours.


                                                viii
Contents

List of Figures                                                                                            xiv

List of Algorithms                                                                                         xvi

1 Introduction                                                                                              1
   1.1   Probabilistic clustering models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         3
         1.1.1   Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        4
         1.1.2   Topic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        5
         1.1.3   Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            6
   1.2   Bayesian nonparametric models for model selection . . . . . . . . . . . . . . . . . . .             8
   1.3   Existing algorithms: challenges and opportunities . . . . . . . . . . . . . . . . . . . .          10
         1.3.1   Variational optimization algorithms . . . . . . . . . . . . . . . . . . . . . . .          10
         1.3.2   MCMC sampling algorithms          . . . . . . . . . . . . . . . . . . . . . . . . . . .    12
   1.4   Outline of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       12
         1.4.1   Open-source software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         14

2 Background                                                                                               15
   2.1   Hierarchical directed graphical models for clustering . . . . . . . . . . . . . . . . . .          15
         2.1.1   Global vs. local random variables . . . . . . . . . . . . . . . . . . . . . . . . .        16
         2.1.2   Allocation model: Generating cluster assignments . . . . . . . . . . . . . . .             17
         2.1.3   Observation model: Generating data from assigned clusters . . . . . . . . . .              19
   2.2   Distributions from the exponential family . . . . . . . . . . . . . . . . . . . . . . . .          19
         2.2.1   Examples of exponential family likelihoods . . . . . . . . . . . . . . . . . . .           20
         2.2.2   Properties of the cumulant function . . . . . . . . . . . . . . . . . . . . . . .          22
         2.2.3   Mean parameterization and Bregman divergences . . . . . . . . . . . . . . . .              23
         2.2.4   Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       26
   2.3   Learning observation model parameters from data . . . . . . . . . . . . . . . . . . .              28
         2.3.1   Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    29
         2.3.2   Maximum Likelihood (ML) Point Estimation . . . . . . . . . . . . . . . . . .               29
         2.3.3   Maximum a posteriori (MAP) point estimation . . . . . . . . . . . . . . . . .              30


                                                     ix
        2.3.4   Posterior estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        31
  2.4   Point estimation algorithms for finite mixture models . . . . . . . . . . . . . . . . . .           32
        2.4.1   Bregman k-means point estimation for finite mixture model . . . . . . . . . .               34
        2.4.2   Bregman k-means++ initialization of point estimates for global parameters .                 35
  2.5   Posterior inference via variational optimization. . . . . . . . . . . . . . . . . . . . . .         36
        2.5.1   Mean-field approximate posterior density estimation . . . . . . . . . . . . . .             37
        2.5.2   Evidence lower-bound objective function . . . . . . . . . . . . . . . . . . . . .           39
        2.5.3   Observation model term of the objective . . . . . . . . . . . . . . . . . . . . .           41
        2.5.4   Allocation model term of the objective . . . . . . . . . . . . . . . . . . . . . .          42
  2.6   Variational inference algorithm for the finite mixture model . . . . . . . . . . . . . .            43
        2.6.1   Local parameter update step . . . . . . . . . . . . . . . . . . . . . . . . . . .           44
        2.6.2   Global parameter update step . . . . . . . . . . . . . . . . . . . . . . . . . . .          45
        2.6.3   Full-dataset optimization algorithm . . . . . . . . . . . . . . . . . . . . . . .           46
  2.7   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      47

3 Scalable inference for DP mixture models                                                                  48
  3.1   Stick-breaking construction of Dirichlet Process . . . . . . . . . . . . . . . . . . . . .          49
        3.1.1   Dirichlet processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       49
        3.1.2   Stick-breaking transformation . . . . . . . . . . . . . . . . . . . . . . . . . . .         50
  3.2   Dirichlet process mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          51
        3.2.1   Generative model for global parameters . . . . . . . . . . . . . . . . . . . . .            52
        3.2.2   Generative model for local variables . . . . . . . . . . . . . . . . . . . . . . .          53
  3.3   Posterior inference as a variational optimization problem . . . . . . . . . . . . . . . .           53
        3.3.1   Mean-field approximate posterior . . . . . . . . . . . . . . . . . . . . . . . . .          54
        3.3.2   Evidence lower-bound objective function . . . . . . . . . . . . . . . . . . . . .           56
        3.3.3   Observation model term of the objective . . . . . . . . . . . . . . . . . . . . .           57
        3.3.4   Allocation model term of the objective . . . . . . . . . . . . . . . . . . . . . .          57
  3.4   Update steps for variational optimization . . . . . . . . . . . . . . . . . . . . . . . .           59
        3.4.1   Global parameter update step . . . . . . . . . . . . . . . . . . . . . . . . . . .          60
        3.4.2   Local parameter update step . . . . . . . . . . . . . . . . . . . . . . . . . . .           61
        3.4.3   Nested truncation w.r.t. number of active clusters . . . . . . . . . . . . . . .            62
  3.5   Algorithms for variational optimization         . . . . . . . . . . . . . . . . . . . . . . . . .   63
        3.5.1   Full-dataset block coordinate ascent algorithm . . . . . . . . . . . . . . . . .            63
        3.5.2   Stochastic variational inference . . . . . . . . . . . . . . . . . . . . . . . . . .        65
        3.5.3   Memoized variational inference . . . . . . . . . . . . . . . . . . . . . . . . . .          67
        3.5.4   Initialization of global parameters . . . . . . . . . . . . . . . . . . . . . . . .         70
  3.6   Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        71
        3.6.1   Clustering image patches with zero-mean Gaussian likelihoods . . . . . . . .                71
        3.6.2   Clustering documents with multinomial likelihoods . . . . . . . . . . . . . . .             74
  3.7   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      77
                                                    x
4 Proposal moves to escape local optima                                                                  78
  4.1   Merge moves for DP mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . .          79
        4.1.1   Merge proposal construction . . . . . . . . . . . . . . . . . . . . . . . . . . .        79
        4.1.2   Evalation of merge proposals . . . . . . . . . . . . . . . . . . . . . . . . . . .       82
        4.1.3   Scalable construction and evaluation via memoized statistics . . . . . . . . .           84
        4.1.4   Selecting a pair of clusters to try merging . . . . . . . . . . . . . . . . . . . .      85
  4.2   Birth moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    85
        4.2.1   Birth proposal construction . . . . . . . . . . . . . . . . . . . . . . . . . . . .      88
        4.2.2   Birth proposal evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      91
        4.2.3   Construction and evaluation via memoized statistics . . . . . . . . . . . . . .          91
        4.2.4   Selecting clusters to target with birth proposals . . . . . . . . . . . . . . . . .      92
  4.3   Delete moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     93
        4.3.1   Delete proposal construction of local responsibilities . . . . . . . . . . . . . .       95
        4.3.2   Delete proposal construction of global responsibilities . . . . . . . . . . . . .        96
        4.3.3   Scalable construction and evaluation with memoized statistics . . . . . . . . .          96
        4.3.4   Selecting the target cluster to delete . . . . . . . . . . . . . . . . . . . . . . .     96
  4.4   Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     97
        4.4.1   Toy example where deletes outperform merges . . . . . . . . . . . . . . . . .            97
        4.4.2   Toy image patch data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       99
        4.4.3   MNIST digit clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
        4.4.4   Clustering tiny images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
  4.5   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Scalable variational inference for HDP Topic Models                                                   103
  5.1   Hierarchical Dirichlet process (HDP) topic models . . . . . . . . . . . . . . . . . . . 104
  5.2   Posterior inference as a variational optimization problem . . . . . . . . . . . . . . . . 107
        5.2.1   Mean-field approximate posterior . . . . . . . . . . . . . . . . . . . . . . . . . 107
        5.2.2   Evidence lower-bound objective function . . . . . . . . . . . . . . . . . . . . . 108
        5.2.3   Global term of the allocation objective, using a surrogate bound . . . . . . . 110
        5.2.4   Document-specific term of the allocation objective . . . . . . . . . . . . . . . 112
        5.2.5   Assignment entropy term of the allocation objective . . . . . . . . . . . . . . 113
  5.3   Update steps for variational optimization . . . . . . . . . . . . . . . . . . . . . . . . 113
        5.3.1   Global parameter update step for observation model . . . . . . . . . . . . . . 114
        5.3.2   Global parameter update step for allocation model . . . . . . . . . . . . . . . 114
        5.3.3   Local update step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
        5.3.4   Sparse restart proposals for local step . . . . . . . . . . . . . . . . . . . . . . 118
        5.3.5   Specialization to bag-of-words datasets . . . . . . . . . . . . . . . . . . . . . . 120
  5.4   Algorithms for HDP topic model posterior estimation . . . . . . . . . . . . . . . . . 120
        5.4.1   Full-dataset variational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
        5.4.2   Stochastic variational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
                                                   xi
        5.4.3   Memoized variational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
  5.5   Variational algorithms with proposal moves that adapt the number of clusters . . . . 125
        5.5.1   Merge proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
        5.5.2   Delete proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
        5.5.3   Birth proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
  5.6   Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
        5.6.1   Toy bars dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
        5.6.2   Academic and news articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
        5.6.3   Image patch modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
  5.7   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6 Scalable variational inference for the HDP-HMM                                                     137
  6.1   Hierarchical Dirichlet Process Hidden Markov Models . . . . . . . . . . . . . . . . . 138
        6.1.1   Generative model for each sequence. . . . . . . . . . . . . . . . . . . . . . . . 138
        6.1.2   Hierarchical prior on transition probabilities via the HDP. . . . . . . . . . . . 140
        6.1.3   Sticky self-transition bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
  6.2   Posterior inference as a variational optimization problem . . . . . . . . . . . . . . . . 141
        6.2.1   Mean-field approximate posterior . . . . . . . . . . . . . . . . . . . . . . . . . 141
        6.2.2   Evidence lower-bound objective function . . . . . . . . . . . . . . . . . . . . . 143
        6.2.3   Surrogate objective for sticky HDP-HMM . . . . . . . . . . . . . . . . . . . . 145
  6.3   Update steps for variational optimization . . . . . . . . . . . . . . . . . . . . . . . . 146
        6.3.1   Global parameter update step for observation model . . . . . . . . . . . . . . 147
        6.3.2   Global step for allocation model . . . . . . . . . . . . . . . . . . . . . . . . . 147
        6.3.3   Local step to update assigned state sequence . . . . . . . . . . . . . . . . . . 149
  6.4   Variational algorithms with fixed number of topics . . . . . . . . . . . . . . . . . . . 150
        6.4.1   Full-dataset algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
        6.4.2   Memoized variational algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 150
  6.5   Variational algorithms with proposal moves that adapt the number of clusters . . . . 152
        6.5.1   Merge proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
        6.5.2   Birth proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
        6.5.3   Delete Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
  6.6   Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
        6.6.1   Toy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
        6.6.2   Speaker Diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
        6.6.3   Motion capture dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
        6.6.4   Chromatin epigenomic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 157
  6.7   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158


                                                  xii
7 Sparse variational posteriors for cluster assignments                                              163
  7.1   Local step algorithms for L-sparse mixture models . . . . . . . . . . . . . . . . . . . 164
        7.1.1   Local step with conventional dense responsibilities. . . . . . . . . . . . . . . . 164
        7.1.2   Local step with L-sparse responsibilities. . . . . . . . . . . . . . . . . . . . . . 166
        7.1.3   Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
        7.1.4   Integration with scalable and adaptive proposal algorithms . . . . . . . . . . 169
  7.2   Experimental results with L-sparse mixture models . . . . . . . . . . . . . . . . . . . 169
        7.2.1   Mixture models for image patches . . . . . . . . . . . . . . . . . . . . . . . . 169
  7.3   Local step algorithms for L-sparse topic models . . . . . . . . . . . . . . . . . . . . . 170
        7.3.1   Mean field for the LDA Topic Model . . . . . . . . . . . . . . . . . . . . . . . 170
        7.3.2   Local step of LDA Training Algorithm . . . . . . . . . . . . . . . . . . . . . . 170
  7.4   Experimental results with L-sparse topic models . . . . . . . . . . . . . . . . . . . . 173
        7.4.1   Topic Modeling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
  7.5   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

8 Recommendations                                                                                    178
  8.1   Parallelization and other tricks for extreme scalability . . . . . . . . . . . . . . . . . 179
  8.2   Approximation guarantees for variational optimization . . . . . . . . . . . . . . . . . 180
        8.2.1   Distance-biased random initializations with guarantees . . . . . . . . . . . . . 180
        8.2.2   Provable spectral algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
  8.3   Extensions for semi-supervised clustering . . . . . . . . . . . . . . . . . . . . . . . . 181
  8.4   Extensions with probabilistic programming . . . . . . . . . . . . . . . . . . . . . . . 184
        8.4.1   A preliminary model specification language . . . . . . . . . . . . . . . . . . . 184
        8.4.2   Towards general-purpose inference . . . . . . . . . . . . . . . . . . . . . . . . 185


                                                  xiii
List of Figures

 1.1   Directed graphical representation of mixture, topic, and sequential models. . . . . .         2
 1.2   Application: discovering expressive models for natural image patches        . . . . . . . .   4
 1.3   Application: discovering common topics from many Wikipedia articles . . . . . . . .           5
 1.4   Application: segmentation of motion capture sensor traces of human exercises. . . .           7
 1.5   Application: segmentation of human genome by regulatory patterns. . . . . . . . . .           8

 3.1   Directed graphical representation of the Dirichlet Process Mixture Model . . . . . .          52
 3.2   Illustration of possible learning rate schedules for stochastic variational inference. . .    66
 3.3   Local optima found when clustering toy alphabet images with DP mixture models. .              72
 3.4   Local optima found when clustering toy bars data with DP mixture models. . . . . .            75
 3.5   Local optima found when clustering NIPS articles with DP mixture models. . . . . .            76

 4.1   Illustration of merge proposal for DP mixtures. . . . . . . . . . . . . . . . . . . . . .     80
 4.2   Illustration of birth proposal for DP mixtures. . . . . . . . . . . . . . . . . . . . . . .   86
 4.3   Illustration of local responsibility construction under birth proposal . . . . . . . . . .    87
 4.4   Illustration of delete proposal for DP mixtures. . . . . . . . . . . . . . . . . . . . . .    94
 4.5   1D Gaussian toy example where merges fail but deletes succeed . . . . . . . . . . . .         98
 4.6   Comparison of scalable DP mixture algorithms on synthetic image patch dataset. . .            99
 4.7   Comparison of scalable DP mixture algorithms on MNIST dataset. . . . . . . . . . . 100
 4.8   Comparison of scalable DP mixture algorithms for clustering tiny images. . . . . . . 101

 5.1   Directed graphical representation of hierarchical Dirichlet process topic model. . . . 105
 5.2   Plots of surrogate bound needed to handle non-conjugacy of HDP topic model . . . 110
 5.3   HDP model selection: point estimation vs. variational with surrogate bound . . . . . 111
 5.4   Illustration of restart proposals for HDP topic model local step. . . . . . . . . . . . . 118
 5.5   Practical examples of merges and deletes on topic models . . . . . . . . . . . . . . . 126
 5.6   Comparison of HDP topic model inference methods on toy bars dataset.            . . . . . . 132
 5.7   HDP topic model results on NIPS, Wikipedia, Science, and NYTimes datasets . . . 134
 5.8   Comparison of DP mixtures and HDP admixtures on 3.5M image patches . . . . . . 135

 6.1   Directed graphical representation of the HDP hidden Markov model (HDP-HMM). . 139


                                                 xiv
6.2   Toy data HDP-HMM algorithm comparison, using sticky and non-sticky model.          . . 160
6.3   Comparison of HDP-HMM algorithms on 21 speaker diarization sequences. . . . . . 161
6.4   Comparison of HDP-HMM algorithms on 6 motion capture sequences.           . . . . . . . 161
6.5   Comparison of HDP-HMM algorithms on 124 motion capture sequences.          . . . . . . 162
6.6   Comparison of HDP-HMM algorithms for chromatin segmentation of human genome. 162

7.1   Speed and accuracy of L-sparse assignment posteriors for training mixture models . 165
7.2   L-sparse mixture model results on millions of image patches. . . . . . . . . . . . . . 169
7.3   Speed and accuracy of L-sparse assignment posteriors for training topic models . . . 171
7.4   Comparison of values for sparsity-level L on topic models . . . . . . . . . . . . . . . 175
7.5   L-sparse topic model results on NIPS, Wikipedia, and NYTimes . . . . . . . . . . . 176

8.1   A probabilistic programming language for specifying clustering models . . . . . . . . 184


                                             xv
List of Algorithms

 2.1   Bregman k-means for point-estimation of finite mixture model       . . . . . . . . . . . .   33
 2.2   Bregman k-means++ initialization for cluster mean point estimates. . . . . . . . . .         36
 2.3   Update for responsibilities given log posterior weights for mixture model. . . . . . .       45
 2.4   Variational coordinate ascent for finite mixture model . . . . . . . . . . . . . . . . .     46
 3.1   Variational coordinate ascent for DP mixture models . . . . . . . . . . . . . . . . . .      63
 3.2   Stochastic variational coordinate ascent for DP mixture models . . . . . . . . . . . .       65
 3.3   Memoized variational coordinate ascent for DP mixture models . . . . . . . . . . . .         68
 5.1   Algorithm for local step under an HDP topic model . . . . . . . . . . . . . . . . . . 117
 5.2   Algorithm for restart proposals used in local step of inference for HDP topic model . 119
 5.3   Variational coordinate ascent for HDP topic model . . . . . . . . . . . . . . . . . . . 121
 5.4   Stochastic variational coordinate ascent for HDP topic model . . . . . . . . . . . . . 122
 5.5   Memoized variational coordinate ascent for HDP topic model . . . . . . . . . . . . . 123
 6.1   Variational coordinate ascent for HDP-HMM . . . . . . . . . . . . . . . . . . . . . . 151
 6.2   Memoized variational algorithm for the HDP-HMM          . . . . . . . . . . . . . . . . . . 159
 7.1   Update for dense responsibilities given log posterior weights for mixture model. . . . 166
 7.2   Update for L-sparse responsibilities given log posterior weights for mixture model. . 166
 7.3   Update for document-specific responsibilities under standard topic model . . . . . . 174
 7.4   Update for document-specific responsibilities under L-sparse topic model. . . . . . . 174


                                               xvi
Reliable and scalable variational inference for nonparametric
              mixtures, topics, and sequences

                       Michael C. Hughes

                         May 2, 2016
Chapter 1

Introduction

A core task of unsupervised machine learning is the problem of discovering an interpretable set of
discrete clusters from a complex dataset. For example, social scientists may wish to find thematic
word clouds for concepts like “campaign finance” or “immigration policy” from news articles and
then browse articles by concept. Biologists may want to discover groups of regulatory proteins that
lead to similar gene transcription behavior. Roboticists may wish to identify common actions like
“eat cereal” or “take medicine” from videos of daily human activities. By grouping high-dimensional
raw data into clusters, we improve our ability to explore and summarize a dataset.
   In this thesis, we consider clustering methods based on an underlying probabilistic model. Spec-
ifying such a model requires first identifying the relevant random variables. We will treat both the
observed data and the data-specific cluster assignments as random variables. We will also treat the
parameters which define a cluster’s essential properties as random variables, including the appear-
ance probability of each cluster and the parameters that control each cluster-specific data-generation
process. Taking a probabilistic view of these quantities allows coherent reasoning under uncertainty.
   The next step of modeling is expressing the relationships between the chosen random variables.
We will use the formal language of directed graphical models (Lauritzen, 1996; Jordan, 2004) to en-
code our assumptions about independence and dependence among random variables. Within directed
graphical models, we study hierarchical Bayesian models for clustering, especially those based on
the exponential family. The foundations of this hierarchical modeling approach are well-established
in textbooks such as Gelman et al. (2013) or Bishop (2006). This approach builds expressive models
from a set of composable primitive distributions for binary, discrete, and continuous random vari-
ables. Modular composition of these primitives can produce complex joint distributions over our
variables of interest while still ensuring that training a model from observed data is tractable.
   The rest of this chapter introduces three well-known hierarchical Bayesian clustering models:
the mixture model for exchangeable datasets, the topic model for grouped data, and the hidden
Markov model for sequential data. After motivating each with a detailed intended application, we
discuss the classic model selection problem of deciding how many clusters to use for a given dataset.
We will see that there exists a Bayesian nonparametric version of each model which can solve this
                                                  1
                                                                                                   2


      cluster
 frequencies       πG                 ···
                             π1G π2G         G
                                            πK                       πG
   assigned
      labels     znn ∈ {1, 2, . . . K}
                                                     z1       z2        z3 · · ·         zN
   observed
       data       xn                                 x1       x2        x3               xN
                                                                                ···
      cluster
     shapes
                 φk                                                  φk
                                     φ2 φK                                K
                                                              Mixture Model

                      πG                                      πG
                                                                          K
                                                     π0              πk
                                   documents
                      πd         d = 1, 2, . . . D                                sequences
                                                                                         D
   zd1       zd2       zd3 · · ·            zdT      zd1      zd2      zd3      · · · zdT
  xd1        xd2       xd3 · · · xdT                 xd1     xd2       xd3 · · · xdT

                      φk                                             φk
                             K                                            K
                Topic Model                                Hidden Markov Model
     Figure 1.1: Directed graphical representation of mixture, topic, and sequential models.
Each model specifies relationships between four random variables: local observed data xn , local
cluster assignments zn , global cluster frequencies π, and global cluster shape parameters {φk }K
                                                                                                k=1 .


model selection problem elegantly. We will then summarize the core contributions of this thesis:
devising efficient algorithms for training Bayesian nonparametric extensions of the mixture model,
topic model, and hidden Markov model that simultaneously solve the model selection problem while
scaling to large datasets.
                                                                                                         3


1.1      Probabilistic clustering models
This section provides a high-level overview of the generative models discussed in this thesis, which
are depicted as directed graphical models in Fig. 1.1. Detailed mathematical description will be
provided in Ch. 2 and later.
   For now, we will treat each model as having a total of K possible clusters, where the value K is as-
sumed known a priori. With this assumption, we can label our clusters with integers k ∈ {1, 2, . . . K}.
Specifying K in practice leads to issues of model selection, which Bayesian nonparametric approaches
address by taking an infinite limit K → ∞. We will cover this thoroughly later in Sec. 1.2.
   By inspecting the graphical models in Fig. 1.1, we see they are composed entirely of four types
of variables: x, z, π, and φ. We begin by defining each of these and explaining its role in our
family of probabilistic clustering models. We then discuss each model shown in Fig. 1.1 more
concretely, explaining the crucial relationships it assumes among these variables and how these
matter in applications.

Observed data x.       The entire dataset x consists of N total observations. If we assume all may be
treated independently, we write x = {x1 , x2 , . . . xN }. If instead we have some group structure, such as
D documents or D sequences each with Td observations, then we write x = {xd1 , xd2 , . . . xdTd }D
                                                                                                 d=1 ,
so that each overall observation with index n has a corresponding d, t index tuple. Generally,
each observation xn may be binary, discrete, or continuous depending on the application. In the
illustration of Fig. 1.1, we show an example where each observation xn is a vector with two real
values, specifying a point in the 2-dimensional Cartesian plane.

Assignment labels z. Throughout all the models in this thesis, we assume each model explains
the n-th observation xn by assigning it to exactly one cluster. The random variable zn ∈ {1, 2, . . . K},
which is a discrete integer, indicates the chosen cluster for observation xn .

Cluster frequencies π. Each model specifies the random variable π G , which is a length-K vector
that is non-negative and sums to one. Each entry πkG gives the overall global probability that cluster
k is assigned to some observation. We use the superscript G to emphasize that this is a global
variable not attached to any specific data observation.
   Some models specify additional vectors in the set π. Each such frequency vector πj provides an
appearance probability for each of the K clusters. Like π G , each vector πj must be non-negative
and sum to one. Some models have global frequency vectors other than π G , such as the HMM in
Fig. 1.1. Other models, such as the topic model, have local, document-specific frequency vectors πd .

Cluster shape parameters φ.           The set of random variables φ contains a parameter vector for
every cluster: φ =    {φk }K
                           k=1 .   Each vector φk defines the data-generating distribution of cluster
k. For example, if the chosen likelihood model is Gaussian, φk would define the associated mean
and variance parameters. If the chosen likelihood model was Multinomial, φk would provide the
                                                                                                    4


 Smooth cluster       Texture cluster


                                                                        DENOISE
Horizontal edge       Vertical edge
    cluster              cluster


                  a                               b                               c
        Figure 1.2: Application: discovering expressive models for natural image patches
Full-covariance Gaussian mixture models for 8x8 pixel patches from natural images. Left (a): Ex-
ample patches drawn from four clusters trained on millions of images. Each cluster may represent
smooth surfaces, detailed texture patterns, or sharp edges. Center (b): Example whole images,
along with select patches color-coded by hypothesized cluster. In the top image of an elk, yel-
low patches have strong horizontal edges while green patches represent textured grass. Right (c):
Patch-level mixture models can be used within the expected patch log-likelihood (EPLL) framework
of Zoran and Weiss (2012) to perform whole-image denoising, as well as deblurring or in-painting
(not shown).


probability of each possible discrete outcome. We will sometimes call these variables observation-
model parameters, because they define how clusters produce observed data. Throughout all the
models in Fig. 1.1, the φ variables are global variables.


1.1.1     Mixture models
The simplest hierarchical Bayesian models we consider are mixture models (Everitt, 1981). In these
models, each data observation xn , such as a text document, an image, or a gene, is assumed to be
generated independently from one assigned cluster.
   Single-membership mixture models have powered many fruitful applications in domains as diverse
as medical imaging, computational biology, social science. A particular motivating application
for our work is modeling small patches (typically about 8x8 pixels in scale) from natural images.
Zoran and Weiss (2011) have shown that using a simple mixture model with full-covariance Gaussian
likelihoods, they can discover image patch clusters which are interpretable as smooth surfaces, strong
edges, or complex textures, as shown in Fig. 1.2. Furthermore, a trained Gaussian mixture model
integrated with clever whole-image reasoning can outperform several baselines for image processing
tasks such as image denoising or deblurring (Zoran and Weiss, 2011, 2012).
                                                                                                   5


                                   doc 7: “Dinosaur Ridge”
                    It is in these rocks, where Arthur Lakes              Topic 1: “dinosaur”
                    discovered the dinosaur bones in 1877. Fifteen     dinosaur
                    quarries were opened along the Dakota              fossil
                    hogback in the Morrison area in search of          rock
                    these fossils. ! In 1989, the Friends of           Morrison
                    Dinosaur Ridge formed to address increasing        Jurassic         φ  1
                    concerns regarding the preservation of the site    !
                    and to offer educational programs on the area's
                    resources.                                           Topic 2: “meteorite”
                                                70%
                                                    30%                meteorite
                                     π7                                rock
                                                                       silicate
                                                                       iron
                                                                       metallic         φ   2
                                     doc 11: “Meteorites”              !
                     Meteorites have traditionally been divided into
                     three categories: stony meteorites are rocks,        Topic 3: “outreach”
                     mainly composed of silicate minerals; stony-      outreach
                     iron meteorites contain large amounts of both     education
                     metallic and rocky material. ! Between the        preservation
                     late 1920s and the 1950s, Harvey H. Nininger      programs
                     traveled across the region, educating local       resources        φ  3
                     people about what meteorites looked like and      !
                     what to do if they thought they had found one.
                                                  92%

                                      π11               8%               many more possible
                                                                         topics not shown


         Figure 1.3: Application: discovering common topics from many Wikipedia articles
Illustration of insights possible from a topic model trained many Wikipedia articles. To the model,
each document is represented as an unordered list of observed words from a fixed vocabulary, some-
times called a “bag of words” representation. The model learns a set of global clusters or topics
indexed by k where the data-generating parameters φk are probability-distributions over the pre-
defined vocabulary. For human consumption, these are often represented by short lists of the top
5 most probable words. The goal of these models is to find a set of topics φ which explain the
corpus well. For an individual document d, the model can infer the assigned labels zdt of individual
words t as well as document-specific probabilities πd over the possible topics. Each document places
probability over a sparse subset of possible topics.


1.1.2    Topic models
In several domains the singular membership assumption of mixture models may be too restrictive.
For example, many news articles discuss several thematic topics: a few sentences about campaign
finance laws may be followed by discussion of election prospects. Topic models (Blei, 2012) extend
mixture models by allowing each group or document to have a specialized mixture of clusters or
topics. Such extensions are often called mixed-membership clustering models(Airoldi et al., 2009).
The most widely-known example is the Latent Dirichlet Allocation topic model (Blei et al., 2003).
This model contains a document-specific random variable indicating the percentage of each document
which is allocated to each of the possible global clusters or topics. Words within the document
are then assumed to come from a mixture model with these document-specific cluster weights but
common global clusters.
   Fig. 1.3 introduces one exciting application of topic models: summarizing many Wikipedia ar-
ticles. Topic models were popularized for text analysis tasks, but have also been successfully used
in genetics (Pritchard et al., 2000; Mimno et al., 2015), action recognition (Wang et al., 2007), and
                                                                                                      6


many other domains.
   For the text domain, these models consume the bag-of-words representation of many articles. Our
goal is to discover meaningful clusters or topics of related words from a finite, predefined vocabulary.
Each global topic is intuitively a cluster of semantically-related words. More formally, the parameter
φk defining topic k specifies a sparse probability distribution over the fixed vocabulary, with “on-
topic” words given high probability. Within a single document, we estimate a mixed-membership
probability vector πd which indicates how often each topic is used in the document.
   Overall, estimating this hidden topic structure can lead to improved browsing or exploration
of a large corpus. The improved representations πd for each document could potentially improve
downstream tasks like retrieval or recommendation.


1.1.3    Hidden Markov models
When modeling data with assumed sequential structure, we can replace the mixture model’s assump-
tion of independence among observations given the global clusters with more structured relationships.
For sequential data, hidden Markov models (Rabiner, 1989) specify pair-wise relationships between
the cluster assignments of neighboring data items in a sequence. This can lead to improved spatial
contuinity of learned segmentations, or alternatively to better recovery of transition patterns.
   The motion capture segmentation task illustrated in Fig. 1.4 provides one motivating application
for the hidden Markov model. Our goal is to discover possible exercise motions such as running,
walking or jumping from the motion capture traces gathered from a single human actor. Observed
data consists of sensor measurements of the positions and angles of various joints on the human body.
These measurements are captured once every 100ms. Given many sequences of these snapshots from
several actors, we wish to segment all sequences jointly into continuous blocks of time labelled with
distinct exercise motions from a common global set of learned cluster labels. The possible exercise
motions are not provided to the algorithm in advance. Instead, they are learned from data.
   Fig. 1.5 introduces another motivating application for sequential modeling: a computational
biology task called chromatin state discovery and segmentation. Within a single chromosome, small
functional segments of DNA are wound into bead-like structures called chromatin. Gene expression
is regulated by external proteins that bind to chromatin. The set of proteins which are present at
a given site determines transcription. Given a large dataset measuring the presence or absence of
different marker proteins across several DNA sequences, the goal is discovering a small set of states
or clusters that represent interpretable patterns of spatially co-occurring regulatory proteins. The
primary challenge is finding a compact set of states that is small enough to be interpretable, but
complex enough to explain biological nuances. Computationally, training effectively from the entire
human genome is also a paramount concern.
                                                                                                                                                              7


                    A: JumpJack                B: Jog               C: Squat           D: KneeRaise            E: ArmCircle          F: Twist


                    G: SideReach               H: Box               I: UpDown          J: ToeTouch1            K: SideBend         L: ToeTouch2

                         A                F                E                     J                   D                 C
                                 50            100              150              200       250            300          350
                     A           F             D                C            B
                                 50            100              150              200
                         F           D             J       F        B             E        A
                                 50            100              150              200       250
                         A            B                C                D              K                 G           F                 H
                                 50            100              150              200       250            300          350          400
                         A            B                C                D                  E                         F
                                 50            100              150              200       250            300          350
                         A           F                 D                     I                   L                     E
                                 50            100              150              200       250            300          350

                                                                                                                                                torso pos.
                                                                                                                                                neck
                                                                                                                                                waist x
                    50
     angle (deg)


                                                                                                                                                waist y
                                                                                                                                                right arm
                    25                                                                                                                          left arm
                                                                                                                                                right wrist
                     0                                                                                                                          left wrist
                                                                                                                                                right leg
                                                                                                                                                left leg
                   −25                                                                                                                          right foot
                                                                                                                                                left foot
                             A                 F                                 D                       C                    B
                                 20           40           60           80         100    120            140     160         180     200
                                                                                 timesteps


    Figure 1.4: Application: segmentation of motion capture sensor traces of human exercises.
Top: Skeleton visualizations of 12 possible exercise behavior types observed across all sequences.
These 12 motions exhaustively cover the motions observed by a human annotator in 6 motion capture
sequences. Middle: Segmentations z of all 6 observed sequences into the 12 possible exercises. Each
sequence was labelled by hand by the same human annotator. Bottom: Sequence 2’s observed
multivariate time series data. Motion capture sensors measure 12 joint angles every 0.1 seconds.
                                                                                                   8


          Figure 1.5: Application: segmentation of human genome by regulatory patterns.
Top: Illustration of chromatin state discovery. Observed data is binary presence/absence of marker
proteins that can attach to chromatin DNA to regulate gene expression. For several human cell
types, we have one sequence of binary data for each of 23 chromosomes. The goal is to learn several
interpretable “states” which define spatially co-occuring patterns of marker proteins. Lower right:
Ernst and Kellis (2010) used an HMM model to discover 51 biologically plausible states. We show
one gene’s segmentation under their estimated model for a subset of these states, colored by post-hoc
hypothesized role in gene expression.


Other possible clustering models

Single-membership mixture models, mixed-membership topic models, and hidden Markov models
for sequences form the three basic model templates we explore in this thesis. Many other useful hier-
archical models for clustering are possible but beyond the scope of this thesis. The numerous list of
related work includes models with spatial structure for image segmentation (Bouman and Shapiro,
1994; Sudderth and Jordan, 2009), models that apply clusters to relational structures within social
networks (Wang and Wong, 1987; Kemp et al., 2006), models which use trees or grammars to es-
tablish relationships between data (Finkel et al., 2007; Liang et al., 2007), and models with notions
of multiple-membership, where each document or group could use any combination of clusters or
features without restriction (Griffiths and Ghahramani, 2007). Our hope is that our new approach
to training these three basic model templates provides a foundation which could be extended to all
related clustering models.


1.2     Bayesian nonparametric models for model selection
Across all clustering models defined above, the conventional model specification requires assuming a
fixed number of clusters K. This number determines the flexibility or capacity of the model. Larger
values of K can capture finer structure within the observed data at the cost of greater storage and
runtime requirements as well as more difficult interpretability of the resulting model. The task of
identifying a good (or good enough) value for K is called model selection.
   The model selection task can often be computationally expensive and difficult in practice. For
example, early work on chromatin state segmentation using an HMM required training a model
                                                                                                        9


with K ≈ 100 states and used ad hoc post-processing to identify a promising set of K ≈ 50 states
(Ernst and Kellis, 2010). Other work assumed a fixed model with K = 25 states without rigorous
justification beyond the need for a small set of interpretable clusters (Hoffman et al., 2012).
   Bayesian nonparametric (BNP) clustering models (Orbanz and Teh, 2010) are a class of hierar-
chical directed graphical models which extend the conventional mixture, topic, and sequence models
via an alternative prior over the global set of cluster random variables π, φ. In these models, the
Dirichlet process (DP) (Ferguson, 1973) prior specifies a generative process which produces a count-
ably infinite set of clusters indexed by k ∈ {1, 2, . . . k, k + 1, . . .}. Each cluster has an associated
appearance probability πk and shape parameter φk , as shown in Fig. 1.1. We formally define this
generative process in Ch. 3. For now, it suffices to realize these models hypothesize an unbounded
number of possible clusters, rather than an a priori finite set of K possible clusters. Previous work
has introduced a Bayesian nonparametric extension of each clustering model we consider. Ferguson
(1973) introduced the DP mixture model, which was later connected to an infinite limit of the finite
mixture model by Neal (1992). Later, Teh et al. (2006) introduced the hierarchical Dirichlet pro-
cess (HDP) topic model and the hierarchical Dirichlet process hidden Markov model (HDP-HMM),
improving on an earlier “infinite HMM” of Beal et al. (2001).
   BNP promise powerful solutions for the model selection task because they inherit the inter-
pretability and extensibility of all hierarchical Bayesian graphical models while avoiding the need
to specify the number of clusters in advance. Instead, given a fixed observed dataset, the posterior
distribution over the assigned clusters across all N observations gives some probability to all possible
groupings or segmentations. One possible event is that a single large cluster explains every obser-
vation. Another possible event is that each observation comes from its own unique cluster. The
total number of possible partition events for a dataset of size N , known as the Bell number, grows
exponentially with N . There are 2 possible segmentations of N = 2 data items, 15 with N = 4,
4140 with N = 8 and 10480142147 with N = 16. The combinatorics of this space make it impossible
to enumerate all possible segmentations. However, BNP models allow coherent reasoning about the
posterior probability of events in this combinatorial space. Thus, the cardinality of the number of
active clusters can be learned from data using these models and need not be specified in advance.
   Another advantage of the BNP approach is its natural extension to large-scale or streaming
applications. Given a trained model for N observations, the DP provides a coherent mechanism to
reuse these existing clusters to explain an additional set of N ′ observations, possibly using additional
clusters for this data as well. The DP prior induces a rich-get-richer property where the expected
number of active clusters grows logarithmically with the total dataset size. In this large-scale
regime, the automatic model selection provided by BNP models is vastly preferred to expensive
cross-validation for the number of clusters K. In theory, BNP models could be trained once on a large
dataset and deliver a useful clustering which is neither too small (missing crucial structure present
in the data) nor too large (containing redundant or irrelevant clusters). However, achieving this
automatic model selection in practice is difficult due to limitations of existing training algorithms.
                                                                                                             10


1.3      Existing algorithms: challenges and opportunities
Given a fixed dataset, training any clustering model whether finite or Bayesian nonparametric
requires deciding the goal of inference. During training, each global variable and local variable can
be treated in two possible ways: either we estimate a specific value of the random variable, producing
a point estimate, or we estimate an approximate posterior distribution for the random variable.
   For example, a point estimate of the mixture model assignment variable zn would yield a specific
value in the set {1, 2, . . . K}, where K is the number of clusters in a finite model. In contrast, an
approximate posterior distribution q(zn ) for variable zn could be represented in two ways. First,
as vector of probabilities rn1 , . . . rnK , where rnk gives the probability that cluster k explains data
observation n. Second, we could represent the posterior via a finite set of samples, z (1) , z (2) , . . . z (S)
from this distribution. These samples could be used to easily compute expectations or moments of
any functions of the random variable zn via standard Monte Carlo methods (Gelman et al., 2013).
   In this section, we review two prominent approaches in the literature for training clustering
models. First, optimization approaches can deliver either point estimates or approximate posterior
distributions. These methods set up an optimization problem based on either the joint likelihood
log p(x, z, φ, π) or the marginal likelihood log p(x). Defining and solving such problems tractably
requires calculus of variations, leading to the name variational optimization methods. Second, sam-
pling algorithms train models by developing Markov chain Monte Carlo (MCMC) algorithms that
deliver samples representing the posterior.
   From the outset, we emphasize that our contributions will focus on variational methods that fully
instantiate approximate posterior representations for all random variables x, z, π, and φ. It is also
sometimes possible to integrate away some random variables and obtain an exact representation of
the joint density of the remaining variables. Integrating away random variables is sometimes called
collapsing or marginalization. See Ch. 8 of Bishop (2006) for basic introduction to marginalization.
When tractable, this can sometimes lead to improved posterior estimates of remaining variables due
to the Rao-Blackwell theorem (Murphy, 2012, Thm. 24.2.1), as advocated by Casella and Robert
(1996). Many collapsed MCMC samplers (Griffiths and Steyvers, 2004) and collapsed optimization
methods (Teh et al., 2008) for our family of clustering models have been successfully developed.
However, these are often difficult to parallelize due to the dependencies induced by marginialization.
Because our focus is scalability and simplicity, we will work without marginalization.


1.3.1     Variational optimization algorithms
A wide variety of possible training algorithms exist for the mixture model and its extensions. For
finite models, point estimation using a maximum likelihood optimization problem with Gaussian-
distributed observations leads in the limit to the well-known k-means algorithm (Lloyd, 1982),
which is surveyed extensively by Jain (2010). This approach estimates a single “hard assignment“
zn ∈ {1, 2, . . . K} for each observation n. Kulis and Jordan (2012) developed a variant of the hard-
assignment k-means algorithm for the Dirichlet process prior.
                                                                                                      11


   In contrast to point estimation of all variables, the expectation-maximization (EM) algorithm
(Dempster et al., 1977) for maximum likelihood training estimates an approximate posterior distri-
bution q(zn ) while still producing point estimates for the global variables π, φ. These zero-variance
point estimation procedures can be applied to Bayesian nonparametric models but will not be useful
for model selection, for reasons discussed in Ch. 2. It is also possible to treat global random variables
π, φ with approximate posterior distributions while using hard point estimates for the local z. This
approach is known as maximization-expectation (ME) (Kurihara and Welling, 2009).
   Treating all variables π, φ, z with approximate posteriors leads to a variational EM algorithm,
where the optimization problem maximizes an objective that marginalizes rather than conditions
on these variables. This approach leads to coherent algorithms for Bayesian nonparametric models
because the objective function can be used for model selection, as we discuss in Ch. 2. Marginal likeli-
hood has long been the training objective of choice for Bayesian model selection (Jefferys and Berger,
1992; Rasmussen and Ghahramani., 2001).
   As examples of previous work, Blei and Jordan (2006) introduced variational methods for DP
mixtures, with later extensions by Kurihara et al. (2007). Variational optimization algorithms for
finite hidden Markov models were first developed by Beal (2003) and MacKay (1997). Recently,
Johnson and Willsky (2014) developed some more scalable variational training for the HDP-HMM.
Variational methods for the finite topic model were first published by Blei et al. (2003) with sev-
eral extensions to the HDP topic model and other mixed-membership models (Teh et al., 2008;
Liang et al., 2007; Wang et al., 2011; Bryant and Sudderth, 2012).
   All the approaches described above frame inference as an optimization problem. The objective
function is to maximize either a conditional likelihood or a marginal likelihood. This objective takes
free parameters for each random variable π, φ, z, which define either the point estimate of the variable
or its approximate posterior. For mixture models and their extensions, the optimization problem can
be solved via block coordinate ascent algorithms, which take as input some initial values of the free
parameters and then iteratively update subsets of free parameters while holding others fixed. For
all the approaches discussed here, even the simple k-means algorithm, the underlying optimization
objective function is non-convex and has many local optima. Depending on the initialization, the
final fixed point reached by the optimization may have wildly varying solution quality both in terms
of the objective function and in terms of human judgement.
   This sensitivity to initialization is especially problematic for variational approaches to Bayesian
nonparametric models. Although an infinite number of clusters exists a priori, conventional opti-
mization methods like Blei and Jordan (2006) require instantiating a finite number of clusters at
initialization. This finite set is maintained throughout training, because the algorithm contains no
steps that add or remove clusters. It may be that some estimated cluster probabilities πk fall to
zero, but these clusters will still be explicitly represented in memory. The true infinite posterior
is thus truncated to a finite approximate posterior. While some theoretical arguments suggest a
large-enough truncation may be satisfactory (Ishwaran and Zarepour, 2002), in practice setting this
truncation level too large can lead to slow progress. Selecting an appropriate truncation level thus
                                                                                                  12


reintroduces the very model selection problem that BNP models promised to avoid. Without a very
clever initialization, fixed-truncation variational methods for BNP models often reach very poor
local optima. The classic advice of performing many random restarts and selecting the best requires
too much computation to be practical for large datasets.


1.3.2    MCMC sampling algorithms
Markov chain Monte carlo (MCMC) training algorithms (Andrieu et al., 2003) are widely used for
performing approximate posterior inference via sampling. These methods begin by instantiating
point-estimate samples of all variables, and then applying carefully constructed Markov transition
operators to evolve new samples which gradually approach the true posterior. After many iterations
of applying the correct transition operator, the produced samples are guaranteed to asymptotically
converge to the true posterior. Samplers have been developed for both finite and infinite ver-
sions of the mixture model (Neal, 1992; Escobar and West, 1995; Rasmussen, 1999; Walker, 2007;
Jain and Neal, 2004), the topic model (Griffiths and Steyvers, 2004; Yao et al., 2009; Teh et al.,
2006), and the hidden Markov model (Scott, 2002; Van Gael et al., 2008; Fox et al., 2011).
   In practice, samplers can be slow to make large changes from their initialization. Multiple
independent runs using different initializations but applying the same transition operators may thus
reach very different regions of the sample space after generous but finite computation budgets.
Samplers for hierarchical models are often slow to mix due to the limited range of each conditional
transition operator. When holding a large set of variables and only updating a much smaller subset,
the number of random samples required to move the whole set of variables to better values may be
enormous.
   Several improvements have been suggested for mixture models by including transition operators
that propose large, joint changes to the whole state space {π, φ, z}. For example, split and merge
proposals (Jain and Neal, 2004) hypothesize splitting an existing data cluster into two smaller sub-
sets or combining two existing clusters into a new larger cluster. These proposals can provide useful
improvement over the non-proposal baseline. Recent extensions for DP mixture models, HDP topic
models, and Bayesian nonparametric Markov models all report useful gains. Yet these algorithms
do not scale well to larger datasets. Furthermore, designing transition operators that maintain the
required detailed-balance constraints transition operators becomes increasingly difficult for more
complicated models, as seen in the complexity of more recent work in this area (Fox et al., 2014;
Chang and Fisher III, 2014).


1.4     Outline of Contributions
This thesis develops new and improved variational optimization algorithms for Bayesian nonpara-
metric models. We design a unified inference framework for a broad class of models including
mixtures, topics, and sequences that is both scalable to millions of examples and reliable at escap-
ing poor initialization to recover the ideal set of clusters or states. While not the first effort on
                                                                                                    13


any of these fronts, our contribution lies in our unified approach that pursues elegant optimization
problems based on marginal likelihood training objectives and is more reliable than competitors at
finding useful, interpretable clusters from uninformed initializations. The outline below specifies our
detailed contributions to the field of unsupervised machine learning, which are organized by chapter.

Ch. 2: Background.        We first establish the background knowledge for our approach in Ch. 2.
This includes careful description of our modeling assumptions as well as some basic optimization
algorithms for finite mixture models.

Ch. 3: Scalable inference for DP mixtures. Next, in Ch. 3 we develop a fixed-truncation
variational inference algorithm for Dirichlet process mixture models. Building on original work from
a NIPS 2013 conference paper (Hughes and Sudderth, 2013), we define a novel optimization prob-
lem for mean-field variational inference which makes fewer restrictive assumptions than alternatives.
We then develop a block-coordinate ascent optimization algorithms which can be extended to both
stochastic and incremental (memoized) settings for processing data at scale. Our memoized algo-
rithm is shown to have the same runtime cost as stochastic but require no nuisance learning rate
parameters. We show promising results on several benchmark datasets.

Ch. 4: Reliable inference for DP mixture models via proposal moves.                    Next, in Ch. 4
we introduce novel proposal moves for variational inference in DP mixture models that can adapt
the number of clusters actively represented by the algorithm during training. We show that birth
moves can add new useful clusters while merge and delete moves can remove existing clusters which
are redundant or irrelevant. We show that these proposal moves are guaranteed to improve our
DP mixtures optimization objective, and that they can be deployed on millions of examples via
our memoized training algorithm. These proposals were first introduced in Hughes and Sudderth
(2013), though significant improvements have been made since.

Ch. 5: Reliable and scalable inference for HDP topic models. In Ch. 5 we develop a new
variational optimization problem for the HDP topic model. Due to the challenging non-conjugacy
of the HDP generative model, several past authors have used zero-variance point estimates to han-
dle top-level random variables π which define cluster frequencies. Building on an AISTATS 2015
conference paper (Hughes et al., 2015a), we show that these point estimates have terrible model
selection properties and suggest instead a tractable surrogate optimization objective that allows
proper Bayesian model selection. Our final optimization objective can be optimized via block coor-
dinate ascent algorithms or with stochastic or memoized scalable algorithms. We show that merge
and delete proposal moves can produce dramatic improvements in solution quality.

Ch. 6: Reliable and scalable inference for HDP hidden Markov models. In Ch. 6 we
extend our earlier results to the HDP hidden Markov model, developing improved optimization
objective functions for both the fixed-truncation and adaptive proposal cases. We show results from
                                                                                                     14


several applications, including the motion capture segmentation from Fig. 1.4 and the chromatin
state segmentation task of Fig. 1.5. These results extend those published in a NIPS 2015 conference
paper (Hughes et al., 2015b).

Ch. 7: Sparse variational posteriors for local assignments. Our final contribution in Ch. 7 is
a new framing of the variational optimization problem that allows scaling to large numbers of clusters
K as well as large datasets. Specifically, we pursue an new approximate posterior for the discrete
cluster assignment variables with sparsity constraints. Standard approaches require estimating a
probability for each and every represented cluster. However, at a given single observation only a few
clusters of the global total K will have significant explanatory power while the rest will have near
zero mass. We suggest an additional constraint: that at most L clusters in this posterior are allowed
non-zero mass. We show that this leads to algorithms for both mixtures and topic models where
moderate L values are faster to train (2-3x) than the dense baseline L = K but yield comparable
heldout predictions.

Ch. 8: Recommendations. We conclude with discussion in Ch. 8 of promising directions for
future work, especially connections to semi-supervised learning and probabilistic programming. Our
hope is that this thesis inspires optimization algorithms for a broad family of probabilistic clustering
models that can simultaneously scale to millions of observed data points and reliably avoid local
optima.


1.4.1     Open-source software
As a final contribution of this thesis, all the algorithms presented for mixture models, topic models,
and sequential models are available in a public software package called BNPy: Bayesian nonpara-
metrics for Python1 . We hope this effort will aid reproducibility of our results and allow practioners
to use our final algorithms easily on their own data.


  1 https://bitbucket.org/michaelchughes/bnpy-dev/.   Accessed May 2016.
Chapter 2

Background

In this chapter, we develop the foundational concepts needed to describe our chosen class of prob-
abilistic clustering models and train these models from observed data. First, we define the proba-
bilistic generative model for all random variables of interest, highlighting a useful two-stage modular
decomposition that separates each model into an allocation model p(z, π) and an observation model
p(x, φ|z). Next, we review key properties of the exponential family of distributions, showing how
the sufficient statistics, conjugacy, and information geometry in this family of distributions lead
to tractable posterior point estimation or posterior density estimation. Finally, we compare and
contrast two possible approaches for optimization-based training of a finite mixture model: poste-
rior point estimation via a k-means-like algorithm and approximate posterior density estimation via
variational methods. This background knowledge for the finite mixture model will inform how we
approach our novel variational methods for Bayesian nonparametric models in later chapters.


2.1     Hierarchical directed graphical models for clustering
In Fig. 1.1, we illustrate mixture models, topic models, and Hidden markov models as graphical
models over related sets of random variables: data x, assignments z, cluster frequency vectors π and
cluster shape parameters φ. We now establish the fundamental probabilistic relationship between
theses random variables. In this chapter, we will assume a finite known set of K clusters. In later
chapters, we will show how to easily extend these models to the Bayesian nonparametric case of
infinitely many clusters.
   Across all the graphical models in Fig. 1.1, we observe that each can be decomposed into two
smaller models which generate subsets of the random variables π, φ, z, x. First, an allocation model
that allocates or assigns cluster labels to every observation in a dataset. This process generates both
the frequency vectors π and the cluster labels z. Second, the observation model generates the shape
parameters φ, and then produces the data x given fixed cluster labels z. Mathematically, this means


                                                  15
                                                                                                       16


that our chosen family of clustering models has a joint probability density tha factorizes as:

                            p(x, z, π, φ) = p(π)p(z|π) · p(φ)p(x|z, φ) .                             (2.1)
                                            | {z }       |    {z     }
                                           allocation model observation model

This factorization is true for both the finite-capacity models of this chapter as well as the Bayesian
nonparametric models of later chapters. For a review of how graphical models imply factorizations
in joint density functions, see Jordan (2004) or Bishop (2006).
   The differences between the mixture model, topic model, and hidden Markov model are entirely
in the form of the allocation model. The structure in Fig. 1.1 shows that all three models have the
same overall factorization assumptions for p(φ)p(x|z, φ), and thus the same observation model.
   The conditional independence implied by our decomposition into allocation and observation mod-
els leads to several useful properties. For example, when the assignments z are known or observed,
then the posterior for π is independent of the posterior for φ. This conditional independence leads
to simplified update steps for all our inference algorithms. The separation of models into allocation
and observation pieces promotes code reuse and easy composibility, because individual allocation
models can be used interchangably with different observation models.
   Below, we present the formal assumptions behind both allocation models and observation models,
in the context of the models from Fig. 1.1. First, we emphasize a crucial distinction between global
and local random variables.


2.1.1    Global vs. local random variables
Each random variable in our joint set x, z, π, φ can be identified as either global or local. This
terminology was introduced by Hoffman et al. (2013). We define the set of local variables as those
specific to some specific subset of the observed data. Global variables are by definition not local or
specific to any particular data.

Examples of local variables.       In general, any random variable indexed by n (indicating a specific
data atom) or by d (indicating a specific document or sequence) is a local random variable. For
example, each xn is a local random variable, as are the corresponding cluster label variables zn ∈
{1, 2, . . . K}. The document-specific frequencies πd in the topic model are also local variables.

Examples of global variables. Each model has global cluster shape parameters φ = {φk }K
                                                                                      k=1 .
Additionally, all models have a global frequency vector π G , which indicates the appearance prob-
ability of each cluster k via the value of πkG . Both π G and φ define the complete set of global
parameters for the mixture and topic models. The HMM in Fig. 1.1 has some addition frequency
vectors {πj }K
             j=0 which are global random variables. These are treated thoroughly in Ch. 6.
   The distinction between global and local variables matters for several reasons. Essentially, a
model is fully-defined by concrete instantiations of each of its global variables. This set of concrete
global variables are all that we need to represent a trained model in memory or on disk. Given
                                                                                                     17


this set, we can apply a model to new, never-before-seen data x′ , or even generate sample data by
running the model’s generative process forward.


2.1.2     Allocation model: Generating cluster assignments
An allocation model defines a generative model for two sets of random variables: a collection of
frequency vectors π and a set of assignment variables z. There is always one assignment variable
zn ∈ {1, 2, 3, . . . K} for each of the N data atoms. However, the size of the set π varies greatly
depending on the chosen allocation model. Within the set π, every model includes a top-level
frequency vector π G , which we call the global frequency vector. For mixture models, this is the
only member of π. More complicated models include additional frequency vectors. For example,
the topic model has one for every document: {πd }D
                                                 d=1 . The HMM has one frequency vector for
every cluster: {πk }K
                    k=1 . The inclusion of additional frequency vectors in π gives these more complex
models increased expressiveness and flexibility.
   Following the factorization in Eq. (2.1), the generative process first generates π, and then con-
ditionally generates z given π. Below, we will first define the complete generative process for the
mixture model. We will later sketch the process for topic models and HMMs, defering some details
to later chapters dedicated to these more complex allocation models.

Allocation process for the mixture model

For the finite mixture model, π G is a non-negative vector of length K that sums-to-one. We denote
the space of such frequency vectors as π G ∈ ∆K , where ∆K refers to the K-dimensional simplex.
The natural probability distribution to produce the random variable π G is the Dirichlet distribution
(Bishop, 2006, Appendix B). This generative model is formalized as:
                                                           γ     γ
                                       π G ∼ DirK (π G |     ,... )                                (2.2)
                                                           K     K
The log density log p(π G ) of the vector π G is
                                                   K                   K
                                                   X              γ    X γ
                    log Dir(π G |γ) = log Γ(γ) −         log Γ(     )+  ( − 1) log πkG             (2.3)
                                                                  K      K
                                                   k=1                k=1

Under this Dirichlet distribution, the expected value of this vector is the uniform distribution over
K choices:
                                                         1      1
                                           E[π G ] = [     . . . ].                                (2.4)
                                                         K      K
The scalar concentration parameter γ > 0 determines the variance around this mean. Large values
γ >> K correspond to very low variance, while small values γ << K encourage very sparse outcomes
for π G with many entries close to zero.
   Given the global frequency vector π G , the mixture model assumes that each data atom indexed by
n has an assigned label zn ∈ {1, 2, . . . K} drawn independently from the same categorical distribution:

                                        zn ∼ CatK (π1G , . . . πK
                                                                G
                                                                  )                                (2.5)
                                                                                                      18


Under this categorical distribution, we have p(zn = k) , πkG . We can write the log density as
                                                                      
                                    XK                                1 if zn = k
                  log p(zn |π G ) =     δk (zn ) log πkG , δk (zn ) =                          (2.6)
                                                                      0 if z 6= k
                                    k=1                                      n

where δk (zn ) is an indicator function which is one if zn = k and zero otherwise.

Allocation models for topics, sequences, and beyond.

Finite topic model. The topic model in Fig. 1.1 extends the mixture model by allowing each
document d its own custom frequency vector πd , which are tied hierarchically via the global vector
π G . Generatively, we still draw π G as in Eq. (2.2). Then, we have each document’s frequencies
drawn independently according to:

                                       πd |π G ∼ DirK (απ1G . . . απK
                                                                    G
                                                                      )                             (2.7)

Then, each assignment in document d is drawn independently

                                         zdt ∼ CatK (πd1 , . . . πdK )                              (2.8)

Finite hidden Markov model. The HMM in Fig. 1.1 extends the mixture model by having a
custom transition vectors πk for each cluster k, and by introducing chain-graph dependencies among
the assignments zd made within each sequence d.
   Generatively, we still draw π G as in Eq. (2.2). Then, for each cluster k ∈ 1, 2, . . . K as well as a
special cluster indexed by 0 for the start of each sequence, we have:

                                       πj |π G ∼ DirK (απ1G . . . απK
                                                                    G
                                                                      )                             (2.9)

Then, the assignments z evolve according to a standard first-order Markov process, with the assigned
label at timestep t > 1 dependent on the previous timestep t − 1:

                 zd1 ∼ CatK (π01 , . . . π0K ),   zdt |zdt−1 ∼ CatK (πzdt−1 1 , . . . πzdt−1 K )   (2.10)

We assume that the first timestep in a sequence is always drawn from the same starting distribution
with frequencies π0 .

Towards a general specification of allocation models. We can imagine a much broader family
of allocation models which include mixture models, topic models, sequence models, and many more.
Generally, the collection π can be seen as many frequency vectors {πj }Jj=1 , whose graphical model is a
tree-like dependency structure, with each child generated by a Dirichlet distribution with mean equal
to its parent and user-supplied variance. Similarly, the collection of assignment random variables
z are generated by a graphical model where each group (document, sequence, etc.) has its own
tree. Within a given tree, each individual node zn is generated from Categorical distribution using
a specific node πj from π. The chosen node in π is specified by the parent of zn in the z graph.
Further discussion of this idea is given in Ch. 8.
                                                                                                    19


2.1.3    Observation model: Generating data from assigned clusters
The observation models we consider assume that the cluster assignments z for every data atom are
known, and must define a generative model for the pair of random variables φ, x. This pair can be
decomposed into K individual cluster shape parameters {φk }K
                                                           k=1 as well as the N observed data
atoms {xn }N
           n=1 . We then assume the following factorized generative model:

                                        K
                                        X                    K X
                                                             X N
                      log p(x, φ|z) =         log p(φk ) +             δk (zn ) log p(xn |φk )   (2.11)
                                        k=1                  k=1 n=1

We will refer to the two probability density functions in the equation above as the likelihood L and
the prior P, where

                                          L(xn |φk ) , p(xn |φk )                                (2.12)
                                                P(φk ) , p(φk )                                  (2.13)

We will assume each of these distributions belongs to the exponential family, and further that the
prior P is conjugate to the likelihood L. This assumption allows us to perform efficient training under
many different observed data scenarios, including binary data, discrete count data, and real-valued
data. The next section formalizes this exponential family assumption and its consequences.


2.2     Distributions from the exponential family
By definition, a density belongs to the exponential family if we can write its log density in the form:

                               log p(xn |φk ) = φTk s(xn ) − c(φ) + h(xn )                       (2.14)

   The vector s(xn ) ∈ RA is some sufficient statistic vector. It contains all functions of the data
atom xn ∈ X necessary for computing the chosen probability density. For example, if data atom
xn is a binary variable (X = {0, 1}) and the likelihood density L is Bernoulli, then the sufficient
statistic is simply the one-dimensional vector s(xn ) = [xn ]. If each data atom xn is a scalar real
value and the likelihood density L is a Gaussian with unknown mean and unknown variance, then
vector s(xn ) will be two-dimensional: s(xn ) = [xn x2n ]. Using these two scalars, both the same mean
and sample variance can be computred.
   The vector φk ∈ Φ ⊂ RA is a natural parameter vector. Each possible instantiation of φk ∈ Φ
defines a single valid density function over the random variable xn .
   The reference measure function h(xn ) contains all terms in the log density that are constant with
respect to the parameter φk . It rarely plays a crucial role in learning tasks, and is needed solely to
make sure that the density L(xn |φk ) integrates to one over the domain of xn ∈ X . It is frequently
                                                                       √
either zero (e.g. h(xn ) = 0) or some other constant (e.g. h(xn ) = log 2π).
   The function c(φ) is known as a cumulant function. This function plays a crucial role, much
more so than h(xn ). The function c(φ) maps every natural parameter to a real scalar: Φ → R.
Specifically, the cumulant function is defined as an integral over all possible data items x ∈ X . The
                                                                                                          20


natural parameter space Φ is the set of all φk vectors for which this integral converges to a finite
value.
                             Z        h                   i
              c(φk ) , log         exp φTk s(xn ) + h(xn ) ,           Φ = {φk ∈ RA : c(φk ) < +∞}    (2.15)
                             x∈X

Only when φk ∈ Φ does the function p(xn |φk ) define a valid probability density function that
integrates to one over its domain.


2.2.1     Examples of exponential family likelihoods
We now review several examples of exponential family likelihood distributions, contrasting the com-
mon way of writing each distribution with the natural parameterization which fits the pattern of
Eq. (2.14).

Example: Categorical likelihood

Consider a random variable xn representing a single word from a finite discrete vocabulary of V
possible word. Formally, the value of xn ∈ {1, 2, . . . V } identifies a particular index of this vocabulary.
The exponential family density which explains this data is called the categorical distribution.

Categorical: Common form. The common parameterization of the Categorical likelihood for
discrete data xn is:
                                                           V
                                                           X
                                       log p(xn |µk ) =          δv (xn ) log µkv                     (2.16)
                                                           v=1

where the parameter vector µk ∈ ∆V is a non-negative vector of size V that sums to one. The entry
µkv defines how probable each symbol v.

Categorical: Natural exponential family form. The sufficient statistic vector s(xn ) and nat-
ural parameter are both V − 1 dimensional vectors:

                                      s(xn ) = [δ1 (xn )δ2 (xn ) . . . δV −1 (xn )]                   (2.17)
                                         φk = [φk1 φk2 . . . φkV −1 ]

Each real value φkv can be interpreted as the logarithm of an odds-ratio comparing the probability
of the v-th word in the vocabulary to the probability of the last word at index V : φkv = log p(xµnkv
                                                                                                    =V ) .
A value near zero indicates these two words are roughly equivalent in probability. A large positive
value indicates v is more likely, while a large negative value indicates V is more likely. Generally,
the larger that φkv is relative to all other entries in φk , the more probable that we will choose word
v.
                                                                                                   21


   The cumulant function c(φk ) and the reference measure h(xn ) are:
                                                   V −1
                                                              !
                                                    X
                                                          φkv
                                  c(φk ) = log 1 +      e                                      (2.18)
                                                            v=1

                                    h(xn ) = 0

Example: Multinomial likelihood

Common form. A Multinomial density is specified by two parameters: the fixed total count C and
                                                                                 P
a probability vector µk . The random variable xn then has domain X = {xn ∈ ZV≥0 : v xnv = C}.
That is, this density places probabilty on each possible vector xn of V non-negative integers that
sums to C.
                                                                      V
                                                      C!              X
                         log p(xn |µk , C) = log QV               +         xnv log µkv        (2.19)
                                                  v=1   xnv !         v=1

The parameter vector µk ∈ ∆K defines the probability of each possible word type v.

Natural exponential family form. For a Multinomial density over atoms with fixed count C,
sufficient statistic vector s(xn ) and natural parameter φk are both V − 1 dimensional vectors:

                                    s(xn ) = [xn1 xn2 . . . xnV −1 ]                           (2.20)
                                       φk = [φk1 φk2 . . . φkV −1 ]

As with the categorical distribution, each real value φkv can be interpreted as the logarithm of the
odds ratio of choosing word v to choosing word V . The larger that φkv is relative to other vocabulary
words, the more probable that we will choose word v.
   The cumulant function c(φk ) and the reference measure h(xn ) are then:
                                                      V −1
                                                                 !
                                                      X
                                                             φkv
                                c(φk |C) = C log 1 +       e                                   (2.21)
                                                              v=1
                                                       C!
                                  h(xn |C) = log QV
                                                      v=1   xnv !

Example: Gaussian univariate likelihood with known variance

Common form. Suppose we have a fixed variance v > 0. Then the unknown mean Gaussian
density has one parameter, mean mk . The domain of the random variable xn is all real numbers:
X = R, and the density of this real random variable is:
                                           1        1       1
                      log p(xn |mk , v) = − log 2π − log v − v −1 (xn − mk )2                  (2.22)
                                           2        2       2
                                                                                                     22


Natural exponential family form. The sufficient statistic vector is one-dimensional, as is the
natural parameter φk ∈ R:

                                              s(xn ) = [xn1 ]                                    (2.23)
                                                 φk = [φk1 ]

The cumulant function c(φk ) and reference measure h(xn ) are defined as implicit functions of the
fixed variance v:
                                        1 2
                                c(φk ) = vφ                                                      (2.24)
                                        2 k1
                                         1          1       1
                               h(xn ) = − v −1 x2n − log v − log 2π
                                         2          2       2

Example: Gaussian univariate likelihood with known zero mean

Common form. Suppose we have a fixed mean m = 0. Then the unknown-variance form of the
Gaussian density has one parameter for each cluster k: the variance vk . The domain of the random
variable xn is all real numbers: X = R, and the density of this real random variable is:
                                                 1        1        1
                        log p(xn |vk , m = 0) = − log 2π − log vk − vk−1 x2n                     (2.25)
                                                 2        2        2

Natural exponential family form. The sufficient statistic vector is one-dimensional, as is the
natural parameter φk ∈ R:

                                              s(xn ) = [x2n1 ]                                   (2.26)
                                                 φk = [φk1 ]

The cumulant function c(φk ) and reference measure h(xn ) are defined as:
                                                     1
                                           c(φk ) = − log(−φk )                                  (2.27)
                                                     2
                                                     1
                                           h(xn ) = − log 2π
                                                     2

2.2.2    Properties of the cumulant function
The cumulant function c(φk ) of any exponential family distribution satisfies several useful properties.
These facts are discussed in Proposition 2.1.1 of (Sudderth, 2006) and also (Wainwright and Jordan,
2008), but have been long known in the statistics community.

Property 1: All cumulant functions are convex functions. Convexity implies that the Hes-
sian matrix for this function ∇2φk c(φk ) must be a positive semi-definite A × A matrix. Additionally,
convexity implies that for any two natural parameters φℓ , φm ∈ Φ and a scalar ξ ∈ [0, 1], we have:

                             c(ξφℓ + (1 − ξ)φm ) ≤ ξc(φℓ ) + (1 − ξ)c(φm )                       (2.28)

The natural parameter space Φ is also a convex space, meaning the interpolation of any two natural
parameters is also a valid natural parameter.
                                                                                                       23


Property 2: Derivatives of the cumulant function correspond to moments of the suffi-
cient statistic. That is:

             ∇φk c(φk ) = Ep(xn |φk ) [s(xn )]                                                      (2.29)
                                                                                             
             ∇2φk c(φk ) = Ep(xn |φk ) s(xn )s(xn )T − Ep(xn |φk ) [s(xn )] Ep(xn |φk ) s(xn )T

This second result suggests that the Hessian of the cumulant function is interpretable as the expected
covariance matrix of the sufficient statistic vector, which agrees with our positive semi-definite result
above.


2.2.3    Mean parameterization and Bregman divergences
The natural parameterization of exponential families from Eq. (2.14) is often not the most convenient
for practical analysis. For example, from a given the parameter φk ∈ Φ it may be difficult to reason
intuitively about what data xn may be likely under the density L(xn |φk ), because xn ∈ X lives in a
different space than Φ.
   Here, we introduce the well-known alternative to using the natural parameter φk : the mean
parameterization, using mean vector µk ∈ M ⊂ RA . Every exponential family density admits a mean
parameterization (Wainwright and Jordan, 2008). This is due to the convexity of the function c(φk ),
which implies (by Legendre duality) a one-to-one invertible transformation between the natural space
Φ and the mean space M. We often find the mean parameterization convenient because by definition,
every sufficient statistic s(xn ) lives in the mean space M or its closure (Wainwright and Jordan,
2008).
   Using the mean parameterization, we can derive a useful “distance” function between any statistic
s(xn ) and any mean parameter µk . This pseudo-distance (which is not always a proper metric) is
called a Bregman divergence (Banerjee et al., 2005). Practical algorithms can work in either the
mean parameter space or the natural parameter space, allowing greater computational flexibility.
The subsections below introduce the crucial formal ideas behind the mean parameterization.

Formal definition of the mean parameter

Given a specific exponential family likelihood and its corresponding natural parameter φk ∈ Φ, there
are two complementary formal definitions of the mean parameter µk . First, we can define this vector
as the expected value of the sufficient statistic given fixed natural parameter φk :

                                          µk , Ep(xn |φk ) [s(xn )]                                 (2.30)

This suggests that sufficient statistic vectors s(xn ) and mean parameters µk live in the same mean
parameter space M. This makes the parameter µk often easier to interpret, because it occupies the
same space as the essential statistics of observed data.
   The second definition of the mean parameter µk is perhaps more computationally intuitive: we
define µk via a one-to-one, invertible transformation from the natural parameter space Φ to the
                                                                                                     24


mean space M. Thus, each natural parameter φk has exactly one corresponding µk , and vice-versa
(Wainwright and Jordan, 2008; Banerjee et al., 2005). First, we can define the natural-to-mean
transformation in terms of the gradient of the cumulant function.

                                           µk = µ(φk ) , ∇φk c(φk )                              (2.31)

The reverse mean-to-natural transformation is then

                                           φk = φ(µk ) , ∇µk γ(µk )                              (2.32)

where the conjugate cumulant function γ(·) is the Legendre dual of the cumulant c(·). Formally,
this conjugate cumulant function γ(µk ) maps input vectors from the mean parameter space to the
real line: γ(µk ) : M → R. The function is also convex and defined as:

                                      γ(µk ) , µTk φ(µk ) − c(φ(µk ))                            (2.33)

Every exponential family density with cumulant function c(φk ) has a corresponding conjugate cu-
mulant function γ(µk ).
   The space M is an open set. For example, for the Bernoulli density the mean space is the
interval (0, 1), while for a univariate zero-mean Gaussian the mean space is the open interval of valid
variances, (0, +∞). For multivariate zero-mean Gaussians, the mean space is the open set of all
possible covariance matrices; that is, all D × D symmetric, positive-definite matrices.
   Let Mc denote the closure of the mean parameter space M. That is, for our Bernoulli example
where M , (0, 1), then the closure is Mc = [0, 1]. This closure space is important because the
sufficient statistic function always lies in the closure: s(xn ) ∈ Mc ⊂ RA . For example, the sufficient
statistic for a Bernoulli data atom is equal to either 0 or 1, which bookend the interval (0, 1). The
sufficient statistic for a multivariate zero-mean Gaussian, which is xn xTn , is a rank-one matrix that
is positive semi-definite, thus closing the set of positive-definite matrices.
   Given a boundary parameter m that lies in the closure Mc but not the interior M, we can
compute functions like γ(µ) by taking the limit as we move from the interior toward the boundary
position m:

                       γ(m) =     lim γ(m + ǫ),          m ∈ Mc and m + ǫ ∈ M                    (2.34)
                                ||ǫ||→0+

Any other function that takes input from the interior of the mean parameter space can also be
computed this way. See Theorem 3.3 of Wainwright and Jordan (2008) for more details.

Bregman divergences

For any convex function γ(·) : M → R with dual function c(·) : Φ → R, we can compute a
non-negative “distance” between two elements of the mean parameter space M via the Bregman
divergence function.

                   Dγ (µa , µb ) , γ(µa ) − γ(µb ) − (µa − µb )T φ(µb )   µa , µb ∈ M            (2.35)
                                                                                                          25


The journal article of Banerjee et al. (2005) provides essential coverage of the useful properties of
Bregman divergence functions. For our purposes, we emphasize that the output of this function is
a non-negative scalar: Dγ (µa , µb ) ≥ 0 for all µa , µb ∈ M, with equality occuring only if µa = µb .
From the definition above, it is clear this function is assymetric, with Dγ (µa , µb ) 6= Dγ (µb , µa ) in
general.
   The Bregman divergence function is useful because it provides a “distance” between a sufficient
statistic vector s(xn ) ∈ Mc and a mean parameter µk ∈ M. While not a proper distance metric
because of its assymmetry, computing a pseudo-distance can still be useful, as we show in later
distance-biased initialization algorithms. We emphasize that computing the Bregman divergence for
sufficient vectors which lie on the closure of the open set M requires applying the same limiting
arguments from Eq. (2.34).
   Aside from its utility as a pseudo-distance, we can also write our density function over random
variable xn in terms of the Bregman divergence function. First, we take definition of Dγ in Eq. (2.35)
and replace the term γ(µk ) with its definition in Eq. (2.33). After canceling out the terms µTk φ(µk )
with opposite signs, we have:

                         Dγ (s(xn ), µk ) = γ(s(xn )) + c(φ(µk )) − s(xn )T φ(µk )                     (2.36)

Using Eq. (2.36), we can write the log probability density function for any exponential family dis-
tribution (Eq. (2.14)) in terms of the Bregman divergence:

                      log p(s(xn )|φk ) = −Dγ (s(xn ), µ(φk )) + γ(s(xn )) + h(xn )                    (2.37)

This highlights the expressive power of the mean parameter transformation and the corresponding
Bregman divergence function.

Example: Bregman divergence for Multinomial likelihood

When xn is a random variable in the domain of all non-negative integer vectors that sum to C, and
we assume xn has a multinomial likelihood, then we have sufficient statistic s(xn ) = xn and the
corresponding Bregman divergence is:
                                                      V
                                                      X               xnv
                                      D(xn , µk ) =         xnv log                                    (2.38)
                                                      v=1
                                                                      µkv
                                                                                     PV
where the mean vector µk must be non-negative real vector that sums to C:             v=1   µkv = C.

Example: Bregman divergence for fixed-variance 1D Gaussian

When xn is a random variable in the domain of scalar reals, and we assume xn has a Gaussian
likelihood with fixed variance v and mean parameter µk , then we have sufficient statistic s(xn ) = xn
and the corresponding Bregman divergence is:
                                                      1 −1
                                     D(xn , µk ) =      v (xn − µk )2                                  (2.39)
                                                      2
                                                                                                      26


where the mean vector µk is any real scalar. We can recognise this divergence function as a Ma-
halonobis distance for univariate inputs. This Bregman divergence happens to be symmetric, but
symmetry does not hold in general.

Example: Bregman divergence for zero-mean, unknown-variance 1D Gaussian

When xn is a random variable in the domain of scalar reals, and we assume xn has a zero-mean
Gaussian likelihood with unknown variance µk > 0, then we have sufficient statistic s(xn ) = x2n and
the corresponding Bregman divergence is:

                                                       1   x2  1              1
                                     D(s(xn ), µk ) = − log n + x2n (µk )−1 −                      (2.40)
                                                       2   µk  2              2
Here, the mean vector µk must be a positive real. Note that the possible edge-case input xn = 0
technically leads to an infinite divergence because of the log(xn ) in the first term, but all other
inputs lie in the proper open set M.


2.2.4        Conjugate Priors
The sections above define the likelihood density p(xn |φk ) = L(xn |φk ) and its useful properties. We
now turn our attention to the prior density p(φk ), which is the other required choice to full specify
an observation model. While any density function over the space Φ would be a valid choice for p(φk ),
we constrain our attention to a specific exponential density P known as the conjugate prior to the
chosen likelihood L. We make this choice because (as we will show in later sections) it leads to a
posterior over φk that comes from the same density family P, with updated parameters. Conjugate
priors lead to closed-form expressions for the updated parameters, making algorithms simpler and
the mathematical formulas easier to understand. For a textbook introduction, see Sec. 2.4.2 of
Bishop (2006).

Log probability density function of the conjugate prior

We will write the log probability density function of the prior distribution P as:

                                                ν , τ¯) = τ¯T φk − ν¯c(φk ) − cP (¯
                                      log p(φk |¯                                 ν , τ¯)          (2.41)

We can identify this density function as a member of the exponential family, with natural parameter
equal to the concatenated vector [¯
                                  τ ν¯] and sufficient statistic equal to the concatenated vector
[φk − c(φk )].
    The scalar hyperparameter ν¯ > 0 defines the effective sample size of the prior. Larger values
of ν¯ imply a highly concentrated, low-variance distribution. Small values indicate a high-variance
distribution.
    The vector hyperparameter τ¯ ∈ M is understood as the aggregated mean.                  Dividing this
value by the effective sample size gives the prior density’s expected value for the mean parame-
ter: Ep(φ|¯ν ,¯τ ) [µ(φ)] = τν¯¯ .
                                                                                                        27


   The prior cumulant function cP (¯
                                   ν , τ¯) is defined as:
                                                Z
                                                         T
                             cP (¯
                                 ν , τ¯) = log        eτ¯ φk −¯ν c(φk ) dφk                          (2.42)
                                                           φk ∈Φ

By definition as an exponential family cumulant function, it is a convex function which can be
differentiated to compute the expected values of φk and c(φk ).

Example: Dirichlet prior for Multinomial likelihood

Common form.         Traditionally, assuming random variable ηk ∈ ∆V is Dirichlet distributed with
parameter λ ∈ RV+ leads to the log density function:
                                                                   V
                                                                   X
                                 log p(ηk |λ) , cDir (λ) +             (λv − 1) log ηk               (2.43)
                                                                   v=1

where the normalizing constant is defined as:
                                                             PV         PV
                          cDir (λ1 , λ2 , . . . λV ) = log Γ( v=1 λv ) − v=1 log Γ(λv )              (2.44)

Natural conjugate form. Instead, we have a density parameterized by a scalar ν¯ > 0 and a
                                                 P
vector τ¯ ∈ RV+ −1 , which is constrained so that v τ¯v ≤ ν¯.
   There exists a one-to-one invertible mapping between the common parameter vector λ ∈ RV+ and
these natural parameters:
                                                    
                                                    τ¯v                   if v < V
                                    λv (¯
                                        τ , ν¯) =                                                    (2.45)
                                                    ν¯ − PV −1 τ¯         if v = V
                                                              w=1 w

Now, the essential quantities for the natural parameterization of Eq. (2.41) for a Dirichlet density are
the natural likelihood parameter φk ∈ RV −1 , the cumulant function c(φk ), and the prior cumulant
function cP :

                             φk = [φk1 φk2 . . . φkV −1 ]                                            (2.46)
                                              V
                                              X −1
                          c(φk ) = log(1 +           eφkv )
                                              v=1
                                    "V −1              #
                                     X                                    PV −1
                      P
                     c (¯
                        τ , ν¯) =           log Γ(¯            ν−
                                                  τv ) + log Γ(¯            v=1   τ¯v ) − log Γ(¯
                                                                                                ν)
                                      v=1

As with the natural form of the Multinomial likelihood, we interpret φk as a vector containing
logarithms of odds ratios. The larger φkv is relative to other entries in φk , the more likely word v
will be.
   Under this prior density, we have the useful expectations which follow from moment-generating
properties of the prior cumulant function:
                                                                                 PV −1
                       Ep(φk |¯τ ,¯ν ) [φkv ] = ∇τ¯ cP (¯            τv ) − ψ(¯
                                                         τ , ν¯) = ψ(¯        ν − w=1 τ¯w )          (2.47)
                                                                          PV −1
                      Ep(φk |¯τ ,¯ν ) [c(φk )] = ∇ν¯ cP (¯           ν − w=1 τ¯w ) − ψ(¯
                                                         τ , ν¯) = ψ(¯                 ν)
                                                                                                    28


We can also show that under this conjugate prior, we have
                                                                             τ¯v
                                              Ep(φk |¯τ ,¯ν ) [µv (φk )] =                       (2.48)
                                                                              ν¯
where µv (φk ) is the mean parameter at vocabulary word v. Thus, we interpret the non-negative
value τ¯v as a pseudocount at word v, and ν¯ as the total pseudocount across all words.

Example: Gaussian prior for fixed-variance Gaussian likelihood

Natural parameterization.            Let scalar ν¯ > 0 define the effective-sample-size pseudocount and
scalar τ¯ define the aggregate mean. Then we have the prior cumulant function defined as an implicit
function of fixed variance parameter v:
                                               1 −1 −1 2 1           1
                              cP (¯
                                  ν , τ¯) =      v ν¯ τ¯ − log[v¯
                                                                ν ] + log[2π]                    (2.49)
                                               2          2          2
Under this prior, we have the expectations:

                         Ep(φk |¯τ ,¯ν ) [φkv ] = ∇τ¯ cP (¯
                                                          τ , ν¯) = v −1 ν¯−1 τ¯                 (2.50)
                                                                     1               1
                                                          τ , ν¯) = − v −1 ν¯−2 τ¯2 + ν¯−1
                       Ep(φk |¯τ ,¯ν ) [c(φk )] = ∇ν¯ cP (¯
                                                                     2               2
as well as the expectations of the mean parameter:

                                              Ep(φk |¯τ ,¯ν ) [µ(φk )] = ν¯−1 τ¯                 (2.51)

We can thus interpret the value τ¯ as determining the expectation of the mean parameter, while the
value ν¯ sets the variance.


2.3      Learning observation model parameters from data
The previous sections have described the generative model for data x and cluster shape parameters
φ which conditions on known cluster assignments z. We now review the two fundamental approaches
for global parameter inference of φ: point estimation and approximate posterior estimation. First,
we emphasize the simplifying role that sufficient statistics of the data x and assignments z can
play in any estimation procedure for φ. We then address maximum likelihood point estimation,
maximum-a-posterior (MAP) point estimation, and proper posterior estimation. For alternative
presentation of these concepts, see Ch. 9 of the textbook by (Murphy, 2012).
                                                                                                     29


2.3.1    Sufficient statistics
Any procedure for estimating φ from data x requires considering the aggregate likelihood for the
whole dataset: p(x|z, φ). Under our assumed exponential family likelihood, we can write the aggre-
gate log likelihood for the whole dataset as
                                       K X
                                       X N
                     log p(x|z, φ) =             δk (zn ) log p(xn |φk )                         (2.52)
                                       k=1 n=1
                                       X N
                                       K X               h                            i
                                  =              δk (zn ) φTk s(xn ) − c(φk ) + h(xn )
                                       k=1 n=1
                                       K 
                                       X                                  XN
                                  =          φTk Sk (x, z) − Nk (z)c(φk ) +   h(xn )
                                       k=1                                     n=1

As a function of φk , this expression has been greatly simplified. The sum of reference measures
h(xn ) is independent of φk and thus can be ignored during any estimation of φk . The remaining
sum over clusters shows that each cluster k may be treated independently.
   The quantities S(x, z) and N (z) are sufficient statistics, defined as
                                                        N
                                                        X
                                        Sk (x, z) ,           δk (zn )s(xn )                     (2.53)
                                                        n=1
                                                        N
                                                        X
                                          Nk (z) ,            δk (zn )                           (2.54)
                                                        n=1

We can interpret Nk (z) as the total number of observations assigned to cluster k. Similarly, we
can interpret Sk ∈ RA as the aggregate data statistic across all observations assigned to cluster k.
Importantly, both statistics have dimension independent of the number of data atoms N . Evaluation
of the likelihood thus need not scale with the number of data atoms N after the statistics have been
computed.


2.3.2    Maximum Likelihood (ML) Point Estimation
We now consider the following optimization problem given data x:

                                        φ∗k = arg max log p(x|z, φ)                              (2.55)
                                                   φk

This can be interpreted as maximizing the likelihood of data x given the point estimates {φk }K
                                                                                              k=1 .
   This maximization objective function can be simplified by expanding the form of the likelihood
L(xn |φk ) and dropping terms constant with respect to the vector φk :
                                              K
                                              X
                             φ∗k = arg max          φTk Sk (x, z) − Nk (z)c(φk )                 (2.56)
                                        φk    k=1

   Because the cumulant function c(φk ) is convex, we know that φTk Sk (x, z) − Nk (z)c(φk ) is a linear
function minus a convex function and therefore concave. This implies our maximization problem
                                                                                                     30


has a unique solution, which we find by taking derivatives, setting to zero, and solving for φ∗k :
                                                                     
                                  0 = ∇φk φTk Sk (x, z) − Nk (z)c(φk )                            (2.57)
                                  0 = Sk (x, z) − Nk (z)µ(φk )

Remember that the function µ(φk ) is defined as the gradient of the cumulant function, and that it is
an invertible function whose inverse µ−1 is defined as the function φ(·) in Eq. (2.32) from the mean
parameter space M to the natural space Φ. Thus, our maximum likelihood point estimate is:
                                                       
                                      ∗       Sk (x, z)
                                     φk = φ               .                              (2.58)
                                               Nk (z)

Mean parameter interpretation.             By replacing the log density function log p(xn |φk ) with the
equivalent density function parameterized by a Bregman divergence in Eq. (2.37) and dropping
constant terms, we have an equivalent optimization problem in terms of the mean parameter µ∗k :
                                                       N
                                                       X
                                      µ∗k = min              Dγ (s(xn ), µk )                     (2.59)
                                               µk ∈M
                                                       n=1

This optimization problem over the mean parameters has a unique solution via Theorem 1 of
Agarwal and Daum´e III (2010).

                                                         Sk (x, z)
                                                 µ∗k =             .                              (2.60)
                                                          Nk (z)

The optimal mean parameter is thus always the weighted sample mean across all observed sufficient
statistics, no matter what exponential family density is used for the observation model likelihood
(Gaussian, Multinomial, Bernoulli, etc.). The optimal natural parameter φ∗k is found by mapping
the optimal mean µ∗k into the natural parameter space: φ∗k = φ(µ∗k ). Substituing µ∗k from Eq. (2.60)
into φ(µ∗k ), we find we sensibly recover the formula for φ∗k from Eq. (2.58).


2.3.3    Maximum a posteriori (MAP) point estimation
We consider now the problem of point estimation of the parameter φk for cluster k under the joint
distribution of the likelihood L and prior P.


                                  φ∗k = arg max log p(x|z, φ) + log(φ)                            (2.61)
                                                φk

This is equivalent to:
                                         K
                                         X                      
                         φ∗k = arg max         φTk Sk (x, z) + τ¯ − (Nk (z) + ν¯)c(φk )           (2.62)
                                 φk
                                         k=1
                                                                                                  31


Mean parameter interpretation. By rewriting the densities in terms of Bregman divergences
and simplifying, we find the equivalent problem in mean space reduces to:
                                     "                                    #
                           ∗                τ¯       PN
                          µk = min ν¯Dγ ( ν¯ , µk ) + n=1 Dγ (s(xn ), µk )                     (2.63)
                                µk ∈M


This MAP optimization problem has a unique solution via Theorem 2 of Agarwal and Daum´e III
(2010).
                                                    Sk (x, z) + τ¯
                                            µ∗k =                                              (2.64)
                                                     Nk (z) + ν¯
We can interpret µ∗k as a weighted sample mean, though now the sample includes the observed
weighted dataset {xn }N
                      n=1 as well as the prior “pseudo-dataset”, which has an effective size of ν
                                                                                                ¯
atoms and an aggregate sufficient statistic value of τ¯.
   Again, we can find the corresponding natural parameter estimate via the mean-to-natural trans-
formation function: φ∗k = φ(µ∗k ). Thus, we have the MAP estimate:
                                                              
                                         ∗      Sk (x, z) + τ¯
                                       φk = φ                                                  (2.65)
                                                 Nk (z) + ν¯
We emphasize that this is the MAP estimate under the joint density log p(x, φ|z). MAP estimates are
well-known to depend strongly on the choice of basis (MacKay, 1998). If some other parameterization
of the prior is used other than the natural conjugate form log p(φ|¯
                                                                   τ , ν¯), the corresponding MAP
estimate will be different.


2.3.4     Posterior estimation
Consider now the problem of determining the posterior distribution p(φ|x, z). We can accomplish
this by inspecting the form of the joint distribution:

                   log p(φ, x|z) = log p(φ) + log p(x|z, φ)                                    (2.66)
                                            K
                                            X
                                = const +         φTk (Sk (x, z) + τ¯) − c(φk )(Nk (z) + ν¯)
                                            k=1

After gather all terms constant with respect to φk , we find the remaining terms have the same
functional form as the prior P over φk . Our chosen conjugacy between the prior P and likelihood L
guarantees that the posterior also belongs to the P density family, with updated parameters:

                                        p(φk |x, z) = P(φk |ˆ
                                                            τk , νˆk )                         (2.67)
                                                  τˆk = Sk (x, z) + τ¯
                                                  νˆk = Nk (x, z) + ν¯

The posterior distribution is specified completely by the two parameters τˆk and νˆk .
   Under our assumed conjugate exponential family observation model, the algorithmic cost of
estimating the full posterior is no greater than the cost of MAP point estimates found earlier.
                                                                                                     32


   We further emphasize that this posterior is exact. If both data x and cluster assignments z
are fully observed, we can find the posterior p(φk |x, z) in closed form, with the sufficient statistics
Nk (z), Sk (x, z) representing all we need to know from x, z.
   Next, we examine algorithms for the case where only the dataset x is observed, and assignments
z and shape parameters φ, and frequencies π G must be learned from data.


2.4     Point estimation algorithms for finite mixture models
Here, we consider the simplest possible probabilistic clustering model in Fig. 1.1, a finite mixture
model with K clusters. The allocation generative process is specified by a single global frequency
vector π G , which we will in this section treat as the mean parameter transformation of a random
variable ϕ ∈ RK−1 which has a (natural-form) Dirichlet distribution with effective sample-size γ
                    1     1
and expected mean [ K ... K ]:

                                                    K−1                    K−1
                                                    X     γ                X
                         log p(ϕ|γ) = const +               ϕk − γ log(1 +     eϕℓ )             (2.68)
                                                          K
                                                    k=1                       ℓ=1
                                              eϕk
                             πk (ϕ) ,        PK−1                                                (2.69)
                                        1+     ℓ=1    eϕℓ

Under this parameterization, we have
                                                               1
                                              E[πk (ϕ)] =                                        (2.70)
                                                               K
Given the fixed value for the frequency vector π(ϕ), we generate each data atom assignment zn
independently:

                                    zn |π(ϕ) ∼ CatK (π1 , π2 , . . . πK )                        (2.71)

Then, given the observed assignments z, we have our standard conjugate-exponential-family obser-
vation model:

                              φk ∼ P(φk |¯
                                         τ , ν¯),         xn |zn , φ ∼ L(xn |φzn )               (2.72)

Our first goal will be an algorithm for point estimation for each of the hidden global random variables
ϕ, φ and local assignments z. We will later extend this to a full variational method for approximate
posteriors.
                                                                                                          33


Algorithm 2.1 Bregman k-means for point-estimation of finite mixture model
Input:
    {xn }N  n=1 : dataset with N exchangeable observations
    γ > 0 : hyperparameters defining the allocation-model prior p(ϕ|γ)
    τ¯, ν¯ : hyperparameters defining the prior P(φk |¯     τ , ν¯)
    {µk }K  k=1 : Initial point    estimates of observation-model    mean parameters, µk ∈ M
    π : Initial point estimate of allocation-model frequency vector
Output:
    {zn }N n=1 : Point estimates of hard assignments
    {µk }K  k=1 : Point estimates of observation-model mean parameters
    π : Point estimates of allocation-model frequency vector
 1: function BregmanKMeans((x, τ¯, ν            ¯, γ, {µk }K
                                                           k=1 , π))
 2:       while not converged do
 3:           for n ∈ 1, 2, . . . N do                                       ⊲ Local parameter update step
 4:               for k ∈ 1, 2, . . . K do
 5:                  Wnk = log πk − D(s(xn ), µk )                        ⊲ Log posterior weight of cluster k
 6:               zn = arg maxk∈{1,2,...K} Wnk

 7:        for k ∈ 1, 2, . . P
                             . K do                                                        ⊲ Summary step
                                N
 8:           Sk (x, z) = n=1 δk (zn )s(xn )
                         PN
 9:           Nk (z) = n=1 δk (zn )

10:        for k ∈ 1, 2, . . . K do                                      ⊲ Global parameter update step
11:           µk = SNk k(x,z)+¯
                         (z)+¯ ν
                                τ
                              γ
                      Nk (z)+ K
12:            πk =     N +γ
13:     return z, µ
Block-coordinate ascent point estimation algorithm for mixture models. Sometimes called hard-
assignment expectation-maximization (EM).
                                                                                                                   34


2.4.1      Bregman k-means point estimation for finite mixture model
Our goal is finding point estimates of ϕ, φ, z. We set up a maximum a-posteriori optimization
problem:

                         max J (π, µ, z, x)                                                                     (2.73)
                         π,µ,z

                  subject to π ∈ ∆K
                                 µk ∈ M,                                        for k = 1, 2, . . . K
                                 zn ∈ {1, 2, . . . K},                          for n = 1, 2, . . . N

where the objective function we wish to maximize is the joint log probability density over all the
variables of interest:
                                     K
                                     X                        K X
                                                              X K               h                          i
   J (x, π, µ, z) = log p(ϕ(π)) +          log p(φk (µ)) +              δk (zn ) log πk + log p(xn |φk (µ)) .   (2.74)
                                     k=1                      k=1 n=1

Crucially, we emphasize that the probability density functions used here are over the natural-form
random variables ϕ and φ. However, the variables we instantiate throughout the algorithm are the
corresponding mean parameters π (for frequency vectors) and µ (for cluster shapes).
   An iterative optimization algorithm is given in Alg. 2.1, based on a similar algorithm from
Banerjee et al. (2005). This algorithm operates in the mean-parameter space, but could equivalently
be written in the natural-parameter space using appropriate transformation functionrs φ(·) : M → Φ
from Eq. (2.32) and µ(·) : Φ → M in Eq. (2.31). We can interpret this algorithm as an extension of
the popular k-means algorithm (Lloyd, 1982; Jain, 2010) to proper mixture models with conjugate-
exponential-family observation models. Sensibly, each iteration of this method has cost that is linear
in the desired number of clusters K and the total number of atoms to cluster N .

Local step: point estimates of assignments z.

Given fixed means {µk }K
                       k=1 for each cluster as well as the frequency vector π, we wish to maximize
our objective J (x, π, µ, z) with respect to z. Rewriting the objective as a function of zn , we have:
                                                   K
                                                   X
                           Jn (xn , zn , µ, π) =         δk (zn )Wnk (µ, π)                                     (2.75)
                                                   k=1

                                  Wnk (µ, π) , log p(xn |φ(µk )) + log πk
                                               = −D(s(xn ), µk ) + log πk
                                               = −c(φ(µk )) + s(xn )T φ(µk ) + log πk

Here, we can interpret Wnk as the log posterior weight for assigning cluster k to data atom n. Larger
values indicate that cluster k provides a better explanation for atom n.
   Note the last definition of Wnk , in terms of the cumulant c(φk ) and the natural parameter, uses
the Bregman divergence definition from Eq. (2.36), simplified by dropping terms that are constant
with respect to cluster index k.
                                                                                                    35


   Applying Lagrange multiplier methods to find the optimal assignment while obeying the hard
constraint zn ∈ {1, 2, . . . K} leads to the update

                                       zn∗ = argmaxk∈1,2,...K Wnk                               (2.76)

We can interpret this simply as choosing the cluster label k ∈ {1, 2, . . . K} which has maximum
posterior probability under the current cluster means µ and frequencies π. This update is guaranteed
to monotonically increase the objective J .

Global step: MAP point estimates of frequency vector π.

Applying the standard arguments for natural-parameter maximum-a-posteriori estimation from
Sec. 2.3.3, we have:
                                γ
                           Nk + K
                   πk∗ =          ,     where π ∗ = arg max log p(ϕ(π)) + log p(z|π)            (2.77)
                           N +γ                       π∈∆K

It is important to realize that this natural-parameter MAP estimate exists for any value of γ ≥ 0.
   In contrast, the equivalent mean-parameter optimization problem, which treats π ∈ ∆K directly
as a random variable, does not always have a MAP solution. In particular, the MAP estimate for
this parameterization is:
                                    γ
                               Nk + K −1
                       πk∗ =             ,     π ∗ = arg max log p(π) + log p(z|π)              (2.78)
                               N +γ−K                 π∈∆K

This solution only exists when Nk + γ/K > 1. Otherwise, the MAP is undefined.


2.4.2     Bregman k-means++ initialization of point estimates for global pa-
          rameters
The point estimation algorithm in Alg. 2.1 requires as input both initial cluster means {µk }K
                                                                                             k=1 , as
well as an initial cluster frequency vector π. Because the overall objective J is non-convex, selecting
a smart initialization is important to avoid bad local optima and reach a high-quality solution.
   Our inspiration for a smart initialization comes from the k-means++ algorithm introduced by
Arthur and Vassilvitskii (2007), which considers the simpler k-means optimization problem using
Euclidean distances instead of general purpose Bregman divergences. The k-means++ algorithm
first selects K distinct data atoms from the dataset x, and then uses the standard update for cluster
mean parameters given a single observation to initialize µk for each of the K clusters. The selection
of the K chosen data atoms is done in a sequential, distance-biased fashion. The first cluster µ1 is
formed from a single data atom chosen uniformly at random. Then, the atom for successive cluster k
is chosen from the remaining atoms with probability proportional to the Euclidean distance between
the atom and the closest previously-chosen mean µk . This distance-biased random sampling leads to
choosing cluster means µk which are far from one another and, if K is large enough, will with high-
probability adequately cover the space of all observed data x. Formally, Arthur and Vassilvitskii
                                                                                                   36


Algorithm 2.2 Bregman k-means++ initialization for cluster mean point estimates.
Input:
    {xn }N  n=1 : dataset with N exchangeable observations
    τ¯, ν¯ : hyperparameters defining the prior P(φk |¯  τ , ν¯)
Output:
    µ : Initial point estimates of the mean parameters
 1: function InitClusterMeansViaBregmanSamples((x, τ¯, ν         ¯))
 2:       n1 ∼ UniformRandomSample({1, 2, . . . N })
                 s(xn1 )+¯τ
 3:       µ1 ← 1+¯     ν
 4:       for k ∈ 2, . . . K do
 5:           for n ∈ 1, 2, . . . N do
 6:               pn = arg minℓ∈{1,2,...k−1} D(µ(s(xn )), µℓ )
 7:          nk ∼ CatN (p1 , . . . pN )
                  s(xnk )+¯
                          τ
 8:          µk ← 1+¯  ν
 9:      return {µk }K
                     k=1
Initialization of cluster means µk using a random sampling scheme based on Bregman divergences.
Under some mild assumptions, Ackermann and Bl¨    omer (2010) show that this procedure will deliver
a solution with guaranteed multiplicative approximation ratio to the optimal objective score.


(2007) proved that the k-means++ procedure delivers an initialization µ that is already within a
multiplicative approximation factor of the optimal k-means cost function.
      Ackermann and Bl¨
                      omer (2010) suggest extending the k-means++ initialization from Euclidean
distances to all Bregman divergences and thus all possible observation models in our probabilistic
clustering framework. Ackermann and Bl¨
                                      omer (2010)’s procedure offers the same appealing guar-
antees: for some Bregman divergences, the cost of the initial clustering is guaranteed to be within
some constant factor of the optimal cost.
      Our initialization algorithm is given in Alg. 2.2. This sequential sampling procedure selects a
single data atom to represent each cluster in a distance-biased way. At each round, we sample
each remaining atom with probability proportional to a divergence whose first argument is the
smoothed-mean estimate for the given atom, and the second is that data atom’s closest center. By
using smoothing when selecting data, we are able to both compute distance values exactly (without
taking limits) and be sure we are choosing mean vectors that are far apart, not just data atoms.
By choosing each successive cluster mean to be far from the previously-chosen means, we hope to
eventually represent all latent clusters in the dataset. Once K means are chosen, we proceed with
Alg. 2.1 until either convergence occurs or a maximum budget of iterations is reached.


2.5       Posterior inference via variational optimization.
The point estimation procedure in Alg. 2.1 is simpler to understand and implement than more so-
phisticated methods. However, the maximum likelihood or maximum-a-posteriori approach has one
key weakness when training probablistic clustering models: the maximum-likelihood pased objective
function J (x, z, π, φ) cannot be used for model selection. That is, we cannot select the right number
                                                                                                         37


of clusters K using this objective function. This is because of the inherent over-fitting property of
maximum-likelihood based approaches: for every solution with exactly K clusters, there exists a
solution with K + 1 clusters with a larger objective function score.
   Beal (2003) covers this model selection issue extensively in his published dissertation, and mo-
tivates a more sophisticated approach: variational optimization of an approximate posterior distri-
bution. Rather than using maximum likelihood approaches which condition on random variables
like π or φ which grow in size with K, these variational approaches use an objective based on the
marginal likelihood log p(x), thus integrating away all random variables which scale with K. Using
the marginal likelihood is an effective, Bayesian way to do model selection (Jefferys and Berger,
1992; Rasmussen and Ghahramani., 2001).
   In this section, we will walk through the defining the optimization problem, work out the closed-
form block-coordinate ascent updates, and give an algorithm for applying this technique to the same
finite mixture model we pursued earlier, to aid side-by-side comparison of algorithmic complexity.
While perhaps more complex to understand, we will show that approach has the same order-of-
magnitude runtime cost as the point-estimation algorithm. Thus, we gain the benefit of improved
model selection without substantial additional computational cost.


2.5.1     Mean-field approximate posterior density estimation
Given a dataset x, our goal is to estimate a joint posterior p(φ, π, z | x). over three variables: the
global cluster appearance probabilities π, the global shape parameters φ and the local assignment
variables z.
   Finding this posterior directly is intractable. Suppose the assignments z use all K distinct
cluster labels. Then even representing the marginal posterior of the assignment variables p(z|x)
would require memory and time that scales with O(K N ), since we would need to enumerate the
probability of each possible joint configuration of the variables {zn }N
                                                                       n=1 and there are K
                                                                                           N
                                                                                             possible
configurations.
   Instead, variational inference casts approximate posterior inference as an optimization problem
(Wainwright and Jordan, 2008). This approach proceeds in three steps, which are detailed in the
sections below. First, we define a simplified family of probability distributions for our variables of
interest φ, π and z. This family of distributions is parameterized by several free variables which
control their moments. Our goal is to find the values of these free variables. Next, we define
an optimization problem whose objective is to find the free variables under which the approximate
posterior is as close as possible to the true, intractable posterior, while still remaining in the specified
simpler family. Formally, we define this “closeness” via Kullback-Liebler (KL) divergence, which is
the natural measure of distance between two probability distributions. Third, we define an algorithm
which, given input data x, delivers the free parameters that solve our optimization problem.
   Let q(φ, π, z) denote our approximate posterior. As prevously discussed, without any simplifying
assumptions even enumerating all possible posterior values is infeasible for all but the smallest
datasets. To make our optimization problem computationally feasible, we assume that each of the
                                                                                                     38


assignments zn is independent of the others and each of the clusters φk is independent of the others.
That is, we can factorize the approximate density as
                                               K
                                               Y                       N
                                                                       Y
                                q(φ, π, z) =         q(φk ) · q(π) ·         q(zn )              (2.79)
                                               k=1                     n=1

This independence is sometimes called a mean-field assumption, after methods from statistical
physics (Parisi, 1988). We further assume each factor of the approximate posterior has a den-
sity that belongs to the exponential family, whose form naturally mimics the generative model when
appropriate. Thus, the full parameterization is:
                                            K
                                            Y
                                  q(φ) =          P(φk |ˆ
                                                        τk , νˆk )                               (2.80)
                                            k=1

                                  q(π) = DirK (π|θˆ1 , θˆ2 , . . . θˆK )                         (2.81)
                                            N
                                            Y
                                   q(z) =         CatK (zn |ˆ
                                                            rn1 , . . . rˆnK )                   (2.82)
                                            n=1

                                                       ˆ rˆ are the free parameters of our approximate
Under this chosen factorization, the variables νˆ, τˆ, θ,
posterior. The goal of variational optimization is to find specific values of these free parameters that
make q(u, φ, z) a good approximation to the true posterior. We always denote free parameters with
hats to make clear which variables are instantiated and optimized by the algorithm.

Approximate posterior q(π) for global cluster frequencies

The vector π ∈ ∆K has an approximate posterior we assume has Dirichlet form, just like its prior.
The free parameter vector θˆ has K entries that are all non-negative: θˆk ≥ 0. Under our assumed
posterior q(π), we have the useful expectations:

                                    θˆk                                      PK
                       Eq [πk ] = PK              Eq [log πk ] = ψ(θˆk ) − ψ( ℓ=1 θˆℓ )          (2.83)
                                        ˆ
                                   ℓ=1 θℓ

where ψ(·) is the digamma function, defined as the first derivative of the log gamma function.

Approximate posterior q(φk ) for global cluster shape

Each cluster k is given an independent posterior factor q(φk ) for its shape parameter φ. Following
the analysis for the exact posterior in Sec. 2.1.3, we assume this factor comes from the conjugate
prior family P. The factor has two free parameters: pseudo-count νˆk > 0 and vector parameter
τˆk ∈ M which defines the cluster shape.

Local assignment factor q(zn )

Under the generative model for the finite mixture, the assigned cluster label zn for observation n
could use any of the K possible global clusters indices. Thus, for the posterior we need a categorical
                                                                                                       39


distribution over K possible discrete choices. We define free parameter vector rˆn to have K entries
that are non-negative and sum to one. We can interpret rˆnk as the posterior responsibility that
cluster k has for data atom n. Values near zero indicate that cluster k does not explain the data
well, while values closer to one indicate a good explanation. Under this approximate posterior, we
have the useful expectation Eq [δk (zn )] = rˆnk .


2.5.2     Evidence lower-bound objective function
Given the assumed factorization of q(π, φ, z) above, we now set up an optimization problem over our
                        ˆ rˆ. The goal is to minimize the distance between the approximate posterior
free parameters νˆ, τˆ, θ,
q(π, φ, z) and the true posterior p(π, φ, z|x). We measure this distance via the Kullback-Liebler
divergence, which we formally define below. We then explain the derivation of an optimization
problem that minimizes the KL divergence between the approximate posterior q(π, φ, z) and the
true posterior p(π, φ, z|x).

Kullback-Liebler (KL) divergence functions

Discrete distribution KL divergence. Consider a discrete random variable Y with K possible
outcomes, each indexed by an integer in the set {1, 2, . . . K}. Let there be two possible distributions
over the variable Y , one with parameter vector P ∈ ∆K and the other with parameter Q ∈ ∆. Under
the first distribution, we have p(Y = k) = Pk , while under the second we have p(Y = k) = Qk .
Then, the KL divergence from vector Q to vector P is:
                                         K
                                         X              Qk
                           KL (QkP ) ,         Qk log      ,       Q ∈ ∆K , P ∈ ∆k ,               (2.84)
                                                        Pk
                                         k=1

For discrete distributions, KL (QkP ) ≥ 0 always, with equality only if Qk = Pk for all outcomes
k ∈ {1, 2, . . . K}. This divergence is assymmetric: KL (QkP ) 6= KL (P kQ).

Continuous distribution KL divergence. For a continuous random variable such as φk , con-
sider two alternative distributions p(φk ) and q(φk ). Assuming that each of these provide non-zero
probability to all elements φk ∈ Φ, we define the KL divergence as:
                                             Z
                                                                q(φk )
                                 KL (qkp) ,          q(φk ) log                                    (2.85)
                                               φk ∈Φ            p(φk )
For distributions over continuous random variables, the KL divergence is always non-negative:
KL (qkp) ≥ 0, with equality only if q and p represent the same probability density function.

KL divergence for our approximate posterior. The KL divergence between a specific member
of our approximate posterior family q(π, φ, z) and the true posterior p(π, φ, z|x) is given by
                              Z    Z          Z       K        K
                                                     X         X                   q(π, z, φ)
 KL (q(π, φ, z)kp(π, φ, z)) ,             ...             ...      q(π, z, φ) log              dφ1 dφK dπ
                               π∈∆K φ1 ∈Φ      φK ∈Φ z =1     z =1
                                                                                  p(π, z, φ|x)
                                                               1     N

                                                                                                   (2.86)
                                                                                                    40


Evalating the above function is not possible, because we do not have a closed-form expression for
the true posterior density function p(π, z, φ|x). We can alternatively write the KL divergence as an
expectation with respect to the approximate posterior q(π, φ, z):
                                                         "                             #
                                                                         q(π, φ, z)
                        KL (q(π, φ, z)kp(π, φ, z)) = Eq(π,φ,z)      log                         (2.87)
                                                                        p(π, φ, z|x)

This does not make evaluation any easier, but certainly makes the mathematical notation easier to
read.
    Using Bayes rule, we can rewrite the (intractable) posterior as the joint probability of all random
variables (including data x) divided by the marginal probability p(x).

                                                          p(π, φ, z, x)
                                         p(π, z, φ|x) =                                         (2.88)
                                                             p(x)

Our generative model specifies the functional form of the numerator. So, it is only the denominator
p(x) that we cannot compute. Substituting into our KL divergence expression, we have:
                                                       "                    #
                                                            q(π, φ, z)p(x)
                 KL (q(π, φ, z)kp(π, φ, z)) = Eq(π,φ,z) log                                     (2.89)
                                                             p(π, φ, z, x)
                                                       "                  #
                                                             q(π, φ, z)
                                            = Eq(π,φ,z) log                 + log p(x)
                                                            p(π, φ, z, x)

where in the second line we’ve used the fact that p(x) is a function independent of pi, φ and z to
move its term outside the expectation.
    We can always compute the first additive term of this KL divergence between our approximate
posterior and true posterior, while the second is a constant independent of the approximate density
q(π, φ, z).

Minimizing KL divergence from approximate posterior to true posterior

Our training goal is to find the specific approximate posterior density q(π, φ, z) that is nearest (in
the sense of KL divergence) to the true posterior.

                             arg min KL(q(π, φ, z|ˆ       ˆ rˆ)||p(π, φ, z|x))
                                                  ν , τˆ, θ,                                    (2.90)
                              ˆ,ˆ
                              ν    ˆr
                                τ ,θ,ˆ

Using the decomposition of KL divergence from Eq. 2.89, we can derive an equivalent maximization
problem:

                                                               ˆ rˆ)
                                          arg max L(x, νˆ, τˆ, θ,                               (2.91)
                                           ˆ,ˆ
                                           ν    ˆr
                                             τ ,θ,ˆ
                                                                                                                       41


The objective function L is defined as the difference between the log marginal probability of the data
log p(x) (which is a constant with respect to our free parameters) and the earlier KL divergence:

                         ˆ rˆ) , log p(x) − KL(q(π, φ, z|ˆ
            L(x, νˆ, τˆ, θ,                                      ˆ rˆ)||p(π, φ, z|x))
                                                         ν , τˆ, θ,                                               (2.92)
                                                   "                                                       #
                             = log p(x) − Eq(π,φ,z) log q(π, φ, z) − log p(x, π, φ, z) + log p(x)
                                          "                                        #
                             = Eq(π,φ,z) log p(π, φ, z, x) − log q(π, φ, z)


Under our chosen exponential family forms for each factor of the approximate posterior, the expec-
                                                                                   ˆ rˆ. We can tractably
tations that define L lead to a closed-form function of the free parameters τˆ, ν, θ,
evaluate L and take derivatives, making optimization of the free parameters possible.
   We emphasize that because the KL divergence term is always non-negative, the objective L can
be interpreted as a strict lower bound on the marginal likelihood log p(x). We thus often refer to the
function L as the Evidence Lower BOund, or ELBO. Maximizing L can be interpreted as improving
the approximate posterior’s explanation of the data.
   We can write the objective L as a sum of two terms:

                                         ˆ τˆ, νˆ) = Ldata (x, rˆ, τˆ, νˆ) + Lalloc (ˆ
                                L(x, rˆ, θ,                                              ˆ
                                                                                     r , θ)                       (2.93)

These terms describe distinctly interpretable pieces of the overall model: Ldata gathers terms related
to the observation model and Lalloc gathers terms related to the allocation model. Our chosen
notation for each term of the objective highlights the variational parameters involved. For example,
Lalloc (ˆ   ˆ is independent of the observation parameters νˆ, τˆ. These terms may also be functions of
        r , θ)
the data x and hyperparameters H, but we omit these arguments in notation for simplicity.
   This term-by-term breakdown of the objective encourages a modular implementation. Given
fixed assignments rˆ, the solution to finding the optimal observation model parameters νˆ, τˆ must be
independent of the allocation probability parameters θ,ˆ and vice versa. This modularization enables
our implementation to implement free parameter updates once for each possible observation model
or allocation model, and compose these modules to create an overall model.


2.5.3     Observation model term of the objective
The data term is defined by
                              h                              ν , τ¯) i
                                                         p(φ|¯
     Ldata (x, rˆ, τˆ, ν) , Eq log p(x|z, φ) + log                                                                  (2.94)
                                                         q(φ|ˆ
                                                             ν , τˆ)
                             N XK                                          K
                            X                            h              i X                   h     p(φk |¯
                                                                                                          ν , τ¯) i
                          =        rˆnk Eq(φk |ˆνk ,ˆτk ) log p(xn |φk ) +   Eq(φk |ˆνk ,ˆτk ) log
                            n=1
                                                                                                   q(φk |ˆ
                                                                                                         νk , τˆk )
                                k=1                                          k=1

Where we have already substituted in the soft assignment free parameters: rˆnk = Eq(zn |ˆrn ) [δk (zn )].
                                                                                                              42


   We can further simplify this objective by using the derivations from Sec. 2.1.3. We have
                                               N
                                               X                K 
                                                                X                                 
                     Ldata (x, rˆ, τˆ, νˆ) =         h(xn ) +         cP (¯
                                                                          τ , ν¯) − cP (ˆ
                                                                                        τk , νˆk )         (2.95)
                                               n=1              k=1
                                               K 
                                               X                               h          i
                                          +               r ) + ν¯ − νˆk Eq(φk ) − cL (φk )
                                                      Nk (ˆ
                                               k=1
                                               K X
                                               X D                                      h    i
                                          +               Skd (x, rˆ) + τ¯d − νˆkd Eq(φk ) φkd
                                               k=1 d=1

Here, we define Sk (x, rˆ) as the expected value of the earlier sufficient statistic Sk (x, z) from Eq. (2.53)
under q(z). Similarly, Nk (ˆ
                           r ) is the expected value of Nk (z) from Eq. (2.53).
                                         N
                                         X                                         N
                                                                                   X
                          Sk (x, rˆ) ,         Eq(zn |ˆrn ) [δk (zn )s(xn )] =            rˆnk s(xn )      (2.96)
                                         n=1                                       n=1
                                         N
                                         X                                 N
                                                                           X
                                r) ,
                            Nk (ˆ              Eq(zn |ˆrn ) [δk (zn )] =         rˆnk                      (2.97)
                                         n=1                               n=1

The remaining expectations in Eq. (2.95) are the fundamental ones of the conjugate exponential
family likelihood-prior system chosen, which have closed form.


2.5.4     Allocation model term of the objective
We write the allocation model’s contribution to the objective as:
                                                         h  p(z|π)       p(π|γ) i
                           Lalloc (ˆ   ˆ ,E
                                   r , θ)             ˆ log        + log                                   (2.98)
                                               r )q(π|θ)
                                           q(z|ˆ            q(z|ˆ
                                                                r)           ˆ
                                                                         q(π|θ)

Regrouping terms, we can separate this into an entropy term for the distribution q(z|ˆ
                                                                                     r) and a term
                               ˆ
which gathers all functions of θ:

                                 Lalloc (ˆ   ˆ = Lentropy (ˆ
                                         r , θ)            r ) + LDir-alloc (ˆ   ˆ
                                                                             r , θ)                        (2.99)

Entropy term

The entropy of the assignments is a simple non-linear function of the responsibilities:
                                                N
                                                X                              N
                                                                               X
                          Lentropy (ˆ
                                    r) = −            Eq [log q(zn )] = −               rˆnk log rˆnk .   (2.100)
                                                n=1                            n=1
                                                                                                                 43


Dirichlet allocation term

Standard conjugate exponential family mathematics yields a simple expression for LDir-alloc :

                                            ˆ = Eq [log p(z|π) + log        p(π)
                            LDir-alloc (ˆ
                                        r , θ)                                   ]                           (2.101)
                                                                            q(π)
                                                  K
                                                  X              γ      γ          ˆ
                                              =         cDir (     . . . ) − cDir (θ)                        (2.102)
                                                                 K      K
                                                  k=1
                                                      K                 
                                                      X           γ
                                                  +      Nk (ˆ
                                                             r) +   − θˆk Eq [log πk ]
                                                                  K
                                                      k=1

                                                                           ˆ and the cumulant
Here, the expectations have closed form under the assumed distribution q(π|θ),
function of the Dirichlet distribution is
                                                                X        K
                                                                         X
                             cDir (a1 , a2 , . . . aK ) , log Γ(  aℓ ) −   log Γ(ak )                        (2.103)
                                                                     ℓ       k=1

Finally, the count statistic Nk (ˆ
                                 r ) for each cluster k is defined above in Eq. (2.96). We can interpret
Nk as the effective count of data observations assigned to cluster label k.


2.6     Variational inference algorithm for the finite mixture
        model
                                                                                                ˆ rˆ
We can now formally define the constrained optimization problem for our free parameters τˆ, νˆ, θ,
given an observed dataset x and hyperparameters H:

               arg max Ldata (x, rˆ, τˆ, νˆ) + Lentropy (ˆ
                                                         r ) + LDir-alloc (ˆ   ˆ
                                                                           r , θ)                            (2.104)
                τˆ,ˆ  ˆr
                   ν ,θ,ˆ
                                 PK
      subject to rˆn ≥ 0 and        k=1 r
                                        ˆnk   =1                                         for n = 1, 2, . . . N
                  θˆk ≥ 0                                                                for k = 1, 2, . . . K
                  νˆk ≥ 0 and τˆk ∈ M                                                    for k = 1, 2, . . . K

Given the internal structure of the objective, we pursue a block-coordinate ascent algorithm, which
proceeds in three steps. Each step updates the free parameters of one factor among q(z)q(u)q(φ)
while holding the other free parameters fixed. When applied iteratively, these steps are guaranteed
to monotonically improve the whole objective L until it converges to a fixed point local optima.
   Below, we define the optimization problem solved by each step of the block-coordinate ascent
algorithm. We then describe detailed solutions which solve each problem below in closed-form in
the following sections.
                                                                                                                                     44


Local step: Update local assignment responsibilities

                    arg max Ldata (x, rˆ, τˆ, νˆ) + Lentropy (ˆ
                                                              r ) + LDir-alloc (ˆ   ˆ
                                                                                r , θ)                                           (2.105)
                        rˆ
                                        PK
        subject to rˆn ≥ 0 and              k=1 r
                                                ˆnk   =1                                                     for n = 1, 2, . . . N

Global step: Update observation model global parameters

          arg max Ldata (x, rˆ, τˆ, νˆ)              subject to νˆk ≥ 0 and τˆk ∈ M for k = 1, 2, . . .                          (2.106)
             τˆ,ˆ
                ν


Global step: Update allocation model global parameters

                        arg max LDir-alloc (ˆ   ˆ
                                            r , θ)                   subject to θˆk ≥ 0 for k = 1, 2, . . . K                    (2.107)
                              θˆ


2.6.1      Local parameter update step
To solve the local step optimization problem in Eq. (2.105), we first simplify the whole objective L
by showing it as a function of the responsibilities at each observation n. We gather those terms that
do not depend on the active responsibilities rˆ for clusters k ∈ {1, . . . K} together into a constant
term:
                                                                                  N
                                                                                  X
                                                ˆ = const(x, ηˆ, τˆ, νˆ) +
                               L(x, rˆ, ηˆ, τˆ, θ)                                      Ln (ˆ                 ˆ
                                                                                            rn , xn , ηˆ, τˆ, θ)                 (2.108)
                                                                                  n=1

Now, the objective at observation n is given by
                                                           K
                                                                                                           !
                                                           X
                                 rn , xn , ηˆ, τˆ, νˆ) ,
                             Ln (ˆ                                 rˆnk             ˆ τˆ, νˆ) − log rˆnk
                                                                          Wnk (xn , θ,                                           (2.109)
                                                           k=1
                                                              h              i          h        i
                               Wnk (xn , ηˆ, τˆ, νˆ) , Eq(φk ) log p(xn |φk ) + Eq(π|θ)
                                                                                     ˆ    log πk                                 (2.110)

Here, we can interpret each Wnk as the log posterior weight that cluster k has on observation n.
Larger values indicate that cluster k is more likely to explain observation n. The expectations that
define Wnk have closed form under our chosen approximate posterior family q(π)q(φ).
   From the decomposition of L into a sum of independent terms for each observation, it is clear
that our objective may be optimized independently for each observation n.

                                                rˆn∗ = arg max Ln (ˆ
                                                                   rn , xn , ηˆ, τˆ, νˆ)                                         (2.111)
                                                             rˆn
                                                                               PK
                                            subject to rˆn ≥ 0 and                k=1 r
                                                                                      ˆnk    =1

Through standard constrained optimization methods, we find the solution for the optimal re-
sponsibility vector rˆn∗ . We can compute the optimal vector in closed-form by setting each entry
k ∈ {1, 2, . . . K} to the exponentiated posterior weight eWnk and then normalizing:

                                                             ∗      eWnk
                                                           rˆnk = PK         ,                                                   (2.112)
                                                                         Wnℓ
                                                                   ℓ=1 e
                                                                                                                  45


Algorithm 2.3 Update for responsibilities given log posterior weights for mixture model.
Input: [Wn1 Wn2 . . . WnK ] : log posterior weights.
Output: [ˆ rn1 . . . rˆnK ] : responsibility values for each cluster
 1: function RespFromWeights(Wn )
 2:    for k ∈ 1, . . . K do
 3:       rˆnk = eWnk
             P
 4:    sn = K   k=1 r  ˆnk
 5:    for k ∈ 1, . . . K do
 6:       rˆnk = rˆnk /sn
 7:    return rˆn
RespFromWeights delivers posterior probabilities of assigning data observation at index n to each
cluster k in the set {1, 2, . . . K}. See Sec. 2.6.1 for details.


which by inspection is guaranteed to obey the required constraints that responsibilities must be
non-negative and sum to one. By performing the update in Eq. (2.112) at each observation n
independently, we will compute the optimal responsibilities rˆ for the whole dataset.


2.6.2     Global parameter update step
Global step for observation model

As discussed in Sec. 2.1.3, the optimal observation global parameters τˆk∗ , νˆk∗ for every cluster k ∈
{1, 2, . . . K, . . .} which solve the optimization problem in Eq. (2.106) can be found in closed form:

                                                    τˆk∗ = Sk (x, rˆ) + τ¯                                    (2.113)
                                                    νˆk∗   = Nk (ˆ
                                                                 r ) + ν¯

These optimal values naturally satisfy the required constraints νˆk∗ ∈ R+ and τˆk∗ ∈ M, so that νˆk∗
and τˆk∗ remain valid parameters for the density q(φk ).

Simplified Ldata objective term.               Substituting the optimal points τˆ∗ , νˆ∗ into the original objec-
tive Ldata in Eq. (2.95) , we see that the last two lines involving the terms Nk + ν¯ − νˆk and Sk + τ¯ − τˆk
will evaluate two zero, leaving a greatly simplified expression:
                                                    N
                                                    X                 K 
                                                                      X                                   
                      Ldata (x, rˆ, τˆ∗ , νˆ∗ ) =          h(xn ) +         cP (¯
                                                                                τ , ν¯) − cP (ˆ
                                                                                              τk∗ , νˆk∗ )    (2.114)
                                                    n=1               k=1

The reference measure term remains unchanged, while the sum over the cumulant functions is over
the K active clusters.

Global step for allocation model

To find the optimal values θˆ for the optimization problem in Eq. (2.107), we can apply standard
constrained optimization techniques like Lagrange multipliers to find the closed-form solution for
                                                                                                    46


Algorithm 2.4 Variational coordinate ascent for finite mixture model
Input:
    {xn }N  n=1 : dataset with N exchangeable observations
    τ¯, ν¯ : hyperparameters defining the prior P(φk |¯ τ , ν¯)
    θˆ : Initial global parameters of the allocation model
    {ˆτk , νˆk }K
                k=1 : Initial global parameters of the observation model
Output:
    θˆ : Final global parameter of allocation model
    {ˆτk , νˆk }K
                k=1 : Final global parameters of the observation model
 1: function VariationalCoordAscentForFiniteMixtureModel
 2:       while not converged do
 3:            for n ∈ 1, 2, . . . N do                                  ⊲ Local parameter update step
 4:               for k ∈ 1, 2, . . . K do
 5:                   Cnk ← Eq [log p(xn |φk )]
 6:                   Wnk ← Cnk + Eq [log πk ]
 7:               rˆn = RespFromWeights(Wn )

 8:        for k ∈ 1, 2, . . P
                             . K do                                                   ⊲ Summary step
                                N
 9:           Sk (x, rˆ) = n=1 rˆnk s(xn )
                         PN
10:           Nk (ˆ
                  r ) = n=1 rˆnk

11:        for k ∈ 1, 2, . . . K do                                  ⊲ Global parameter update step
12:           τˆk ← Sk (x, rˆ) + τ¯
13:           νˆk ← Nk (ˆ r ) + ν¯
14:           θˆk ← Nk (ˆ r) + K γ


Block coordinate ascent algorithm for approximate posterior inference for the finite mixture model.
We alternate between two updates: one for the local assignment parameters rˆ given fixed global
parameters, and another for the global allocation parameters θˆ and the global observation parameters
τˆ, νˆ. All update steps scale linearly with the number of observations N and clusters K.


each cluster k:
                                                             γ
                                             θˆk∗ = Nk (ˆ
                                                        r) +                                   (2.115)
                                                             K

Naturally, these updates preserve the required constraints θˆ ≥ 0.


2.6.3    Full-dataset optimization algorithm
Alg. 2.4 gives the overall algorithm for finding the optimal free parameters under our optimization
objective function L for a given dataset x. Given some initial global parameters, it iteratively cycles
between the local step for q(zn ) in Eq. (2.111) and the global parameter updates. The global step is
then decomposed into an update for the observation model in Eq. (2.113) and the allocation model
in Eq. (2.115), which can be done independently once the required sufficient statistics are computed.
                                                                                                              47


2.7     Discussion
We emphasize that Alg. 2.1 and Alg. 2.4 are two possible approaches to training a finite mixture
model with K possible clusters from observed data x. The first uses a MAP point estimation
objective:
                                 K
                                 X                       K X
                                                         X N               h                          i
   J (z, π, µ) = log p(ϕ(π)) +         log p(φk (µ)) +             δk (zn ) log πk + log p(xn |φk (µ))    (2.116)
                                 k=1                     k=1 n=1

while the variational approach optimizes a lower bound on the marginal likelihood :

                                        L(ˆ   ˆ τˆ, νˆ) ≤ log p(x|¯
                                          r , θ,                  ν , τ¯, γ)                              (2.117)

By inspection, we see that the steps of the full variational approach in Alg. 2.4 have almost the same
runtime complexity as the point estimation algorithm. Certainly, the global step in both algorithms
has exactly the same complexity given the sufficient statistics. Likewise, the computation of the
weight matrix W in each algorithm has the same cost. The key difference lies in representing the
point estimated integer zn vs. the approximate posterior responsibility vector rˆn . The variational
approach has the additional step of performing the RespFromWeights update, which requires
appreciable more work than simply taking the cluster index with maximum weight. However, the
modest additional cost of the full variational approach yields significant gains in model selection.
Chapter 3

Scalable variational inference for
Dirichlet process mixture models

We now consider the simplest Bayesian nonparametric model for clustering: the Dirichlet process
mixture model. Recall the finite mixture model with a fixed set of K possible clusters from Ch. 2.
One formal interpretation of the DP mixture model is the limit of the finite mixture model as the
number of clusters K → ∞. By providing an infinite number of possible clusters a-priori, the DP
mixture model provides a flexible way to avoid an overly restrictive fixed number of clusters and
easily transition from a few hundred examples to a few million examples, because the model capacity
adjusts automatically.
   Both MCMC samplers (Neal, 1992; Rasmussen, 1999; Walker, 2007) and early variational opti-
mization algorithms (Blei and Jordan, 2006; Kurihara et al., 2006) are well-known for training DP
mixture models from datasets. This chapter describes and extends our earlier work described in an
NIPS 2013 conference paper co-authored with Erik Sudderth (Hughes and Sudderth, 2013), whose
fundamental contributions are:

Contribution 1: Improved approximations to the posterior via nested truncation.
While there are an infinite number of clusters available to this model under both its prior and
posterior, only a finite number of clusters can be practically represented in a computer. Previ-
ous variational approaches (Blei and Jordan, 2006; Kurihara et al., 2006) had chosen approximate
posterior distributions q(z)q(φ)q(π G ) with stringent truncation assumptions that lead to modeling
artifacts. Instead, we take inspiration from Teh et al. (2008) to consider a truncation which is more
elegant yet equally fast to train.

Contribution 2: New memoized inference algorithm for scalable training from mil-
lions of examples. Inspired by the incremental expectation-maximization (EM) algorithm from


                                                 48
                                                                                                      49


Neal and Hinton (1998), we develop our own incremental algorithm for training approximate poste-
rior representations rather than point estimates. Using memoized or cached sufficient statistics, we
can process small subsets of data before each global parameter update step yet maintain. Unlike
stochastic algorithms, our memoized algorithm avoids nuisance learning rates entirely and formally
guarantees the ELBO objective function will monotonically increase after every step.

Roadmap. We first formally define the stick-breaking transformation, which we then use to define
a generative process for the DP mixture model with independently-sampled global random variables.
With the generative model in place, we develop a variational optimization problem with standard
mean-field assumptions and our chosen nested truncation for approximate posteriors for each as-
signed cluster label q(zn ). Closely following the variational algorithm for finite mixture models from
Ch. 2, we develop the objective function and the relevent update steps for global and local variables.
We then develop two scalable alternatives: stochastic variational and memoized variational.
    In this chapter, we focus on fixed-truncation algorithms and leave description of adaptive pro-
posals for adding and removing clusters to Ch. 4.


3.1      Stick-breaking construction of Dirichlet Process
The Dirichlet process defines a distribution over random probability measures, or equivalently func-
tions over a predefined event space that are non-negative and sum to one. Here we provide a formal
definition of this stochastic process as well as a generative construction.


3.1.1      Dirichlet processes
Suppose that Φ is a measurable event space, either real or discrete. For example, Φ could be the
set of all positive integers or the set of all real numbers. We first define a properly-normalized base
measure P over this space. For example, if Φ is the real line, then the density P could be a univariate
                                                                                      P
Gaussian distribution. Formally, we require that if Φ is a discrete space, we have φ∈Φ P(φ) = 1,
                            R
while if continuous we have φ∈Φ P(φ)dφ = 1.
    Now, consider another measure R over the event space Φ. We say that R is a realization of
a Dirichlet process with base measure P, if for any complete partition of Φ into K disjoint pieces
labelled T1 , T2 , . . . TK we have:

                      [R(T1 ) R(T2 ) . . . R(TK )] ∼ DirK (γP(T1 ), γP(T2 ), . . . γP(TK ))         (3.1)

Here, the scalar γ > 0 is the concentration parameter of the Dirichlet process. We write this
realization as R ∼ DP(γ, P). For further formal details of this definition, see (Sudderth, 2006,
Theorem 2.5.1).
    For example, consider the event space of all countable integers Φ = {1, 2, 3, 4, . . .}. Let the base
measure P be an infinite vector of probabilities that sum to one: {Pk }∞
                                                                       k=1 . The realization R would
also correspond to an infinite vector of probabilities {Rk }∞
                                                            k=1 . Suppose we partitioned the event
                                                                                                           50


space Φ into the first two clusters T1 = {1}, T2 = {2}, and all remaining clusters T>2 = {3, 4, . . .}.
Then, we would consider R a realization of a Dirichlet process if:

                               [R1 , R2 , R>2 ] ∼ Dir3 (γP1 , γP2 , γP>2 )                               (3.2)

where the subscript >2 is defined as the infinite remaining sum of all terms beyond index 2: R>2 =
P∞
  k=3 Rk .
   This definition is not constructive, because it does not give us a formula for generating sample
realizations R. We develop such a construction in the next section.


3.1.2    Stick-breaking transformation
Recall that under the finite mixture model from Ch. 2, we draw the K-dimensional frequency vector
                                     γ     γ
from a Dirichlet prior: π G ∼ DirK ( K ... K ). Taking the infinite limit of this density as K → +∞ is
not directly tractable. We cannot generate samples from the infinite limit the same way we would
usually draw finite samples. Instead, we develop an alternative construction which lets us build
an infinite-dimensional frequency vector π G from a series of independent, identically-distributed
random variables {uk }∞
                      k=1 . Formally, there is a one-to-one invertible transformation between π
                                                                                                G

and u called the stick-breaking transformation, which was developed by Sethuraman (1994).

Stick-breaking transformation           Consider two infinite-dimensional vector spaces:
                                                      ∞
                                                      X
                                              ∞
                             ∆∞ , {π ∈ R          :         πk = 1,   πk ≥ 0 ∀k}                         (3.3)
                                                      k=1

                             U∞ , {u ∈ R∞ : uk ∈ [0, 1] ∀k}

First, the simplex ∆∞ contains normalized vectors π of positive real values that sums to one. Second,
the space U∞ contains unnormalized vectors u of real values between zero and one.
   There exists a deterministic, invertible, one-to-one transformation between each normalized vec-
tor π ∈ ∆∞ and a corresponding vector u ∈ U∞ , defined as:
                                         Qk−1                                    πk
                          πk (u) = uk      ℓ=1 (1−uℓ )          uk (π) =   1−
                                                                                Pk−1
                                                                                        πk
                                                                                                         (3.4)
                                                                                  ℓ=1


The transformation in Eq. (3.4) is often called the stick-breaking transformation. Because it is
invertible, we can apply it to any probability density defined over the space U∞ and obtain a
corresponding density over ∆∞

Sampling a realization via stick-breaking.                  Consider an infinite series of values u = {uk }∞
                                                                                                           k=1
consisting of independent draws from a Beta distribution: uk ∼ Beta(1, γ). Consider a corresponding
infinite series of values φ = {φk }∞
                                   k=1 drawn as independent samples from the base measure: φk ∼ P.
We can construct a realization of the Dirichlet process as:
                               ∞
                               X                                           k−1
                                                                           Y
                          R=         πkG (u)δ(φk ),         πkG (u) = uk         (1 − uℓ )               (3.5)
                               k=1                                         ℓ=1
                                                                                                     51


Here, the realization R is composed of infinitely-many point-masses φk , each with a frequency value
πkG . The frequency at cluster k, denoted πkG is the stick-breaking transform of the independent
conditional probabilities u into the simplex. This transformed vector π G satisfies the definition of
the Dirichlet process from Eq. (3.1) (Sudderth, 2006).

Stick-breaking construction is size-biased. Under the stick-breaking construction, the ex-
pected values of each entry of frequency vector π G will decrease as the index k increases:
                                                             1
                                 Euk ∼Beta(1,γ) [π1 (u)] =                                         (3.6)
                                                           1+γ
                                                             1      γ
                                 Euk ∼Beta(1,γ) [π2 (u)] =        ·
                                                           1+γ 1+γ
                                                             1      γ   γ
                                 Euk ∼Beta(1,γ) [π3 (u)] =        ·   ·
                                                           1+γ 1+γ 1+γ
                                                             γ k−1
                         ...     Euk ∼Beta(1,γ) [πk (u)] =
                                                           (1 + γ)k

Thus, in general, realizations of the vector π G (u) will tend to have larger values for early indices
than later ones. We call this property size-biasedness.
   This property does not exist in the finite mixture model, where we draw π G from a symmetric
                                     γ     γ
Dirichlet distribution: π G ∼ DirK ( K ... K ). To see the connection, we imagine drawing an infinite
vector p ∈ ∆∞ from the following limiting density
                                                        γ      γ
                                      p ∼ lim DirK ([     . . . ])                                 (3.7)
                                           K→∞          K      K
After sorting the resulting vector p in descending order, we’ll find that the distribution of the sorted
vector p is equivalent to the distribution on π G .

Alternative stochastic process priors. The Dirichlet process (DP) serves as the underlying
stochastic process for all the probabilistic clustering models we consider. Extensions to richer class
of Pitman-Yor (PY) process (Pitman and Picard, 2006) or the logistic stick-breaking process should
be possible and at least for the PY mixture model should be straightforward, but we do not explore
these here.


3.2     Dirichlet process mixture models
The Dirichlet process mixture model, often called a DP mixture model, was first introduced by
Ferguson (1973). This widely-used Bayesian nonparametric model generates an exchangeable dataset
x = {x1 , . . . xN } with N total observations from a countably infinite set of clusters with labels in
the set of integers {1, 2, 3, . . . k, k + 1 . . .}. Each cluster with label k has two parameters: the
appearance probability πkG and the cluster shape parameter φk . These are the global parameters.
Each observation n has one local unknown variable: its discrete assignment zn ∈ {1, 2, . . .}. A
directed graphical model is shown in Fig. 3.1.
                                                                                                      52


                                                         ˆk
                                                         ω
                                                         ˆ k uk K
                                                         u
                                                  πG

                       z1            z2             z3 rˆ· · · rˆ zN
                             rˆ1            rˆ2          3       N

                      x1             x2             x3         ···          xN

                                           νˆk φk
                                           τˆk    K

       Figure 3.1: Directed graphical representation of the Dirichlet Process Mixture Model
We show the fundamental random variables (circled nodes), hyperparameters (gray), and variational
free parameters (red) of the Dirichlet process (DP) mixture model, a Bayesian nonparametric model
with K → ∞ clusters. Each cluster is defined by two global parameters: conditional probability
uk and shape parameter φk . The conditional probabilities u are deterministically mapped to global
cluster probabilities π G via the invertible stick-breaking transformation. Each of the N observed
data atoms xn has an associated local cluster assignment zn . The data is assumed to be exchangeable
and conditionally independent given the global parameters. Under our chosen nested truncation for
the approximate posterior, each q(zn ) is defined by a responsibility vector rˆn which places mass only
on the first K cluster labels.

3.2.1    Generative model for global parameters
We use the earlier stick-breaking construction to generate each cluster k’s two global parameters:
the conditional probability uk and the shape parameter φk .

Allocation model prior on global cluster frequencies.              We generate the conditional proba-
bility uk ∈ (0, 1) independently from a Beta distribution: uk ∼ Beta(1, γ). We interpret this as
the probability of choosing cluster k among the infinite set of cluster labels with index at least k
or larger: {k, k + 1, . . .}. From the series {uℓ }kℓ=1 , we can obtain the corresponding frequency value
πkG = πk (u) via the stick-breaking transformation. Because this transformation is invertible, any
distribution on quantity u implies a corresponding distribution on π G . We can choose to work with
either π G or u when convenient. We find that usually it is easier to work with u.
                                                                                                              53


Observation model prior on global cluster shape parameters.                               Specifying a DP mixture
model requires a choice for the exponential family conjugate-prior density P for observation parame-
ters {φk }∞
          k=1 . As discussed in Sec. 2.1.3, each cluster’s parameters are drawn i.i.d. from this density:
φk ∼ P (φk |¯
            τ , ν¯). Recall that hyperparameter ν¯ determines the effective-sample-size of this prior,
while τ¯ then determines then expected mean parameter: Eφk ∼P [µ(φk )] = ντ¯¯ .


3.2.2     Generative model for local variables
Given fixed global parameters, we sample a cluster label from the infinite set {1, 2, . . .} according to
the probability vector π(u), a deterministic function of u. Then, given the chosen cluster label zn ,
we draw an observation xn from the observation model’s likelihood density with parameter φzn .

                                         zn ∼ Cat∞ (π1 (u), . . . πk (u), . . .)                            (3.8)
                                        xn ∼ L(φzn )                                                        (3.9)

We previously introduced the exponential family likelihood density L in Sec. 2.1.3.
   The complete joint probability is then

                         X ∞
                         N X                                           X∞
   log p(x, z, u, φ) =             δk (zn ) log L(xn |φk ) + log πk (u) +   log P(φk ) + log Beta(uk |1, γ)
                         n=1 k=1                                              k=1
                                                                                                           (3.10)
                                                             Q
where by substituting in the definition πk (u) = uk             ℓ<k (1   − uℓ ) we have

                                                               k−1
                                                               X
                                       log πk (u) = log uk +          log(1 − uℓ )                         (3.11)
                                                                ℓ=1


3.3      Posterior inference as a variational optimization problem
Given a dataset x, our goal is to estimate the posterior distribution for all latent random variables
in Fig. 3.1. That is, we’d like a joint posterior p(φ, u, z | x), over the three random variables of
interest: the global conditional appearance probabilities u (which determine π(u), the global shape
parameters φ and the local assignment variables z. As with the finite mixture model from Ch. 2, we
develop a variational optimization algorithm (Wainwright and Jordan, 2008) in three steps. First,
in Sec. 3.3.1 we define a simplified family of probability distributions for our variables of interest
z, u, and φ, which forms our approximate posterior distribution. Next, in Sec. 3.3.2 we define an
optimization problem whose objective is to find the free variables under which the approximate
posterior is as close as possible in KL divergence to the true, intractable posterior. Third, in
Sec. 3.4 we define an algorithm which, given input data x, delivers the free parameters that solve
our optimization problem.
                                                                                                     54


3.3.1    Mean-field approximate posterior
Let q(z, u, φ) denote our approximate posterior. As prevously discussed, without any simplifying
assumptions even enumerating all possible posterior values is infeasible for all but the smallest
datasets. To make our optimization problem computationally feasible, we assume that each of
the assignments zn is independent of the others, and each of the cluster parameters uk , φk are
independent of the others. That is, we can factorize the approximate density as
                                            ∞
                                            Y                ∞
                                                             Y                N
                                                                              Y
                             q(u, φ, z) =         q(φk ) ·         q(uk ) ·         q(zn )       (3.12)
                                            k=1              k=1              n=1

We further assume each independent factor of this approximate posterior has a density that belongs
to the exponential family. The form of this density is chosen to mimic the generative model:
                                          ∞
                                          Y
                                 q(φ) =         P(φk |ˆ
                                                      τk , νˆk )                                 (3.13)
                                          k=1
                                          Y∞
                                 q(u) =         Beta(uk |ˆ
                                                         ηk1 , ηˆk0 )                            (3.14)
                                          k=1
                                          N
                                          Y
                                 q(z) =         Cat∞ (zn |ˆ
                                                          rn1 , . . . rˆnk . . .)                (3.15)
                                          n=1

Under this chosen factorization, the variables νˆ, τˆ, ηˆ, rˆ are the free parameters of our approximate
posterior. The goal of variational optimization is to find specific values of these free parameters that
make q(u, φ, z) a good approximation to the true posterior. We always denote free parameters with
hats to make clear which variables are instantiated and optimized by the algorithm.

Approximate posterior q(uk ) for global cluster frequencies

Each cluster k in our countably infinite set is given an independent posterior factor q(uk ). Just like
its generative counterpart p(uk ), we assume that the approximate posterior factor q(uk ) takes the
form of a Beta distribution. Below, we discuss two possible parameterizations of a Beta approximate
posterior distribution. First, the common parameterization which is most straightforward. Second,
an alternative which specifically decouples a mean and a concentration.

Common parameterization.           Let factor q(uk ) takes the form of a Beta distribution with common
parameters ηˆk1 , ηˆk0 : uk ∼ Beta(uk |ˆ
                                       ηk1 , ηˆk0 ). These free parameters ηˆk1 > 0, ηˆk0 > 0 are global
parameters for the allocation model. If ηˆk1 >> ηˆk0 , then Eq [uk ] ≈ 1; otherwise if ηˆk0 >> ηˆk1 then
Eq [uk ] ≈ 0. This is why we index these free parameters with 1 and 0: they may be interpreted as
pseudo-counts for the events uk = 1 and uk = 0.
                                                                                                              55


Alternative parameterization.                 We can also write the approximate posterior in terms of param-
eters   {ˆ    ˆ k }∞
         uk , ω    k=1 :

                                                   ∞
                                                   Y
                                          q(u) =         Beta(uk |ˆ  ˆ k , (1 − u
                                                                  uk ω          ˆk )ˆ
                                                                                    ωk )                   (3.16)
                                                   k=1

                                                ˆk ∈ [0, 1] exactly sets the expected value of uk :
Under this parameterization, the free parameter u
Eq [uk ] = uˆk , while ω
                       ˆ k > 0 controls the variance.
   There is a one-to-one invertible transformation between u
                                                           ˆ, ω
                                                              ˆ and ηˆ:
                                                                                                 ηˆk1
                           ηˆk1 = u
                                  ˆk ω
                                     ˆk                                          uˆk (ˆ
                                                                                      η) =                 (3.17)
                                                                                             ηˆk1 + ηˆk0
                           ηˆk0 = (1 − u
                                       ˆk )ˆ
                                           ωk                                    ω
                                                                                 ˆ k (ˆ
                                                                                      η ) = ηˆk1 + ηˆk0

   This alternative parameterization is useful because the expected value of π G (u) can be written
entirely in terms of u
                     ˆ:
                                                                    k−1
                                                                    Y
                                              Eq [π G (u)] = u
                                                             ˆk           (1 − uˆℓ )                       (3.18)
                                                                    ℓ=1


Approximate posterior q(φk ) for global cluster shape

Each cluster k in our countably infinite set is given an independent posterior factor q(φk ) for its
shape parameter φ. Following the analysis in Sec. 2.1.3, we assume this factor comes from the
conjugate prior family P. The factor has two free parameters: pseudo-count νˆk > 0 and vector
parameter τˆk which defines the cluster shape.

Local assignment factor q(zn )

Under the generative model, the assigned cluster label zn for observation n could use any of the
infinitely many cluster indices. Thus, we have naively written above that q(zn ) defines a discrete
distribution over infinitely many clusters. We can write the free parameters for this factor as a
vector of countably infinite length with one entry per cluster: rˆn = rˆn1 , rˆn2 , . . . rˆnk , . . .. To be a
valid parameter for the categorical distribution, this infinite vector must obey two constraints: (1)
                                                                                     P∞
every entry is non-negative: rˆnk ≥ 0, and (2) the whole vector sums to unity:         k=1 r
                                                                                           ˆnk = 1.
Intuitively, each scalar entry rˆnk ∈ [0, 1] represents the probability data xn is assigned to cluster
k under the approximate posterior. This probability value is sometimes called the responsibility.
Formally, rˆnk , Eq(z) [δk (zn )].
   The problem with defining q(zn ) as a distribution over an unbounded number of clusters is
that we cannot represent an infinite vector rˆn in a practical implementation. However, despite
an infinite number of clusters available a priori from a BNP model, given a finite dataset with N
atoms only a finite set of K unique cluster labels will be assigned, where 1 ≤ K ≤ N . The remaining
unassigned clusters will be conditionally independent of the data. Inspired by this fact, we introduce
                                                                                                           56


an additional constraint on the vector rˆn : only the first K entries have non-zero values; all remaining
entries with cluster index k > K equal exactly zero.

                              q(zn |ˆ
                                    rn , K) = Cat∞ (zn |ˆ
                                                        rn1 , rˆn2 , . . . rˆnK , . . .)                (3.19)
                                                   ∞
                                                   X                     ∞
                                                                         X
                                s.t. rˆn ≥ 0,            rˆnk = 1,              rˆnk = 0                (3.20)
                                                   k=1                 k=K+1

Here, integer K > 0 is a fixed truncation level, which can be held constant throughout classic
inference or optimized by our proposal moves in Ch. 4.
   This local truncation assumption constrains only the form of the local assignment factor q(zn ). No
additional truncation assumptions are needed for either q(φ) or q(u). Instead, a direct consequence
of assuming q(zn ) has positive mass only the first K clusters is that all clusters with index k > K
are conditionally independent of the observed data. We thus cannot learn anything about the free
            τk , νˆk }∞
parameters {ˆ                                                             ηk }∞
                      k=K+1 from our data, nor the allocation parameters {ˆ   k=K+1 . Instead, these
factors have closed-form optimal values which match the corresponding prior marginals p(uk ) and
p(φk ) under the generative model.


3.3.2    Evidence lower-bound objective function
Given the assumed factorization of q(u, φ, z) above, we now set up an optimization problem over our
free parameters νˆ, τˆ, ηˆ, rˆ. The goal is to minimize KL divergence between the approximate posterior
q and the true posterior (Wainwright and Jordan, 2008).

                               arg min KL(q(u, φ, z|ˆ
                                                    ν , τˆ, ηˆ, rˆ)||p(u, φ, z|x))                      (3.21)
                                 ˆ,ˆ
                                 ν τ ,ˆ
                                      η ,ˆ
                                         r

Computing this KL divergence directly is not possible. However, following the arguments from our
variational optimization problem for finite mixture models from Sec. 2.5.2, we define an equivalent
optimization problem:

                L(x, νˆ, τˆ, ηˆ, rˆ, K) , log p(x) − KL(q(u, φ, z|ˆ ν , τˆ, ηˆ, rˆ, K)||p(u, φ, z|x))   (3.22)
                                                   h                                       i
                                        = Eq(u,φ,z) log p(x, u, φ, z) − log q(u, φ, z)                  (3.23)

Under our chosen exponential family forms for each factor of q(u, φ, z), the expectations that define
L lead to a closed-form function of the free parameters τˆ, νˆ, ηˆ, rˆ. We can tractably evaluate L and
take derivatives, making optimization of the free parameters possible.
   We emphasize that because the KL divergence term is always non-negative, the objective L can
be interpreted as a strict lower bound on the marginal likelihood log p(x). We thus often refer to the
function L as the Evidence Lower BOund, or ELBO. Maximizing L can be interpreted as improving
the approximate posterior’s explanation of the data.
   We can write the objective L as a sum of two terms:

                              L(x, rˆ, ηˆ, τˆ, νˆ) = Ldata (x, rˆ, τˆ, νˆ) + Lalloc (ˆ
                                                                                     r , ηˆ)            (3.24)
                                                                                                                       57


These terms describe distinctly interpretable pieces of the overall model: Ldata gathers terms related
to the observation model and Lalloc gathers terms related to the allocation model. Our chosen
notation for each term of the objective highlights the variational parameters involved. For example,
Lalloc (ˆ
        r , ηˆ) is independent of the observation parameters νˆ, τˆ. These terms may also be functions of
the data x and hyperparameters H, but we omit these arguments in notation for simplicity.
   This term-by-term breakdown of the objective encourages a modular implementation. Given
fixed assignments rˆ, the solution to finding the optimal observation model parameters νˆ, τˆ must be
independent of the allocation probability parameters ηˆ, and vice versa. This modularization enables
our implementation to implement free parameter updates once for each possible observation model
or allocation model, and compose these modules to create an overall model.


3.3.3     Observation model term of the objective
The data term is defined by
                              h                              ν , τ¯) i
                                                         p(φ|¯
     Ldata (x, rˆ, τˆ, ν) , Eq log p(x|z, φ) + log                                                                  (3.25)
                                                         q(φ|ˆ
                                                             ν , τˆ)
                             N X∞                                          ∞
                            X                            h              i X                   h     p(φk |¯
                                                                                                          ν , τ¯) i
                          =        rˆnk Eq(φk |ˆνk ,ˆτk ) log p(xn |φk ) +   Eq(φk |ˆνk ,ˆτk ) log
                            n=1
                                                                                                   q(φk |ˆ
                                                                                                         νk , τˆk )
                                k=1                                            k=1

Where we have already substituted in the soft assignment free parameters: rˆnk = Eq(zn |ˆrn ) [δk (zn )].
   We can further simplify this objective by using the derivations from Sec. 2.1.3. We have
                                                 N
                                                 X              ∞ 
                                                                X                                 
                       Ldata (x, rˆ, τˆ, νˆ) =       h(xn ) +         cP (¯
                                                                          τ , ν¯) − cP (ˆ
                                                                                        τk , νˆk )                (3.26)
                                                 n=1            k=1
                                                 X∞                             h          i
                                            +               r ) + ν¯ − νˆk Eq(φk ) − cL (φk )
                                                        Nk (ˆ
                                                 k=1
                                                 ∞ X
                                                 X D                                      h    i
                                            +               Skd (x, rˆ) + τ¯d − νˆkd Eq(φk ) φkd
                                                 k=1 d=1

                                                                                         PN
where we have the previously defined sufficient statistics for the expected count Nk = n=1 rˆnk
                                    PN
and the expected data statistic Sk = n=1 rˆnk s(xn ) for each cluster k. The remaining expectations
are the fundamental ones of the conjugate exponential family likelihood-prior system chosen, which
have closed form.


3.3.4     Allocation model term of the objective
We write the allocation model’s contribution to the objective as:
                                                                          K
                                                            h    p(z|π) X       p(uk |γ)
                                  r , ηˆ) , Eq(z|ˆr)q(u|ˆη ) log
                          Lalloc (ˆ                                     +   log                                   (3.27)
                                                                 q(z|ˆ
                                                                     r)         q(uk |ˆ
                                                                                      η)
                                                                              k=1
                                                                                                               58


Regrouping terms, we can separate this into an entropy term for the distribution q(z|ˆ
                                                                                     r) and a term
which gathers all functions of ηˆ:

                                 Lalloc (ˆ
                                         r , ηˆ) = Lentropy (ˆ
                                                             r ) + LDP-alloc (ˆ
                                                                              r , ηˆ)                      (3.28)

Entropy term

The entropy of the assignments is a simple non-linear function of the responsibilities:
                                                N
                                                X                           N
                                                                            X
                           Lentropy (ˆ
                                     r) = −           Eq [log q(zn )] = −         rˆnk log rˆnk .          (3.29)
                                                n=1                         n=1

Because this is the entropy of a discrete random variable, by definition we known that this term will
always be positive or zero: Lentropy (ˆ
                                      r ) ≥ 0.

Stick-breaking allocation term

Standard conjugate exponential family mathematics yields a simple expression for LDP-alloc :
                                                                        p(u)
                          LDP-alloc (ˆ
                                     r ) = Eq [log p(z|π(u)) + log           ]                             (3.30)
                                                                        q(u)
                                            ∞
                                            X
                                        =        cBeta (1, γ) − cBeta (ˆ
                                                                       ηk1 , ηˆk0 )                        (3.31)
                                            k=1
                                              ∞ 
                                              X                          
                                            +              r ) + 1 − ηˆk1 Eq [log uk ]
                                                       Nk (ˆ
                                                k=1
                                                ∞ 
                                                X                         
                                            +               r ) + γ − ηˆk0 Eq [log 1 − uk ]
                                                       Nk> (ˆ
                                                k=1

Here, the expectations have closed form under the assumed Beta distribution q(uk |ˆ
                                                                                  ηk ):

                                       Eq [log uk ] , ψ(ˆ
                                                        ηk1 ) − ψ(ˆ
                                                                  ηk1 + ηˆk0 )                             (3.32)
                                  Eq [log 1 − uk ] , ψ(ˆ
                                                       ηk0 ) − ψ(ˆ
                                                                 ηk1 + ηˆk0 )                              (3.33)

and the cumulant function of the Beta distribution is

                           cBeta (a1 , a0 ) , log Γ(a1 + a0 ) − log Γ(a1 ) − log Γ(a0 )                    (3.34)

                                    r ) and Nk> (ˆ
Finally, the summary statistics Nk (ˆ              r ) for each cluster k ∈ {1, 2, . . . K, . . .} are defined as:
                                                   P
                           XN                       N rˆnk if k ≤ K
                                                         n=1
                 Nk (ˆr) ,      Eq(z) [δk (zn )] =                                                          (3.35)
                                                   0               if k > K
                           n=1
                                                            
                            N    ∞                     i PK
                                                                ℓ=k+1 Nk (ˆ
                                                                          r ) if k < K
                           X     X           h
                      r) ,
                 Nk> (ˆ                Eq(z) δℓ (zn ) =                                                     (3.36)
                               n=1 ℓ=k+1
                                                            0                  if k ≥ K

We can interpret Nk as the effective count of data observations assigned to cluster label k, and Nk>
as the effective count of observations with a label larger than k. Under the assumed truncation level
                                                                                                                      59


K, for every inactive cluster label k > K we have rˆnk = 0, which implies that both counts are also
zero: Nk = 0 and Nk> = 0. Further, at the final active cluster index K, we have NK
                                                                                 >
                                                                                   = 0 as well by
definition. Both quantities Nk and Nk> in Eq. (3.35) are linear functions of the responsibilities rˆ.


3.4     Update steps for variational optimization
We can now formally define the constrained optimization problem for our free parameters τˆ, νˆ, ηˆ, rˆ
given an observed dataset x:

                arg max Ldata (x, rˆ, τˆ, νˆ) + Lentropy (ˆ
                                                          r ) + LDP-alloc (ˆ
                                                                           r , ηˆ)                                 (3.37)
                 τˆ,ˆ
                    ν ,ˆ
                       η ,ˆ
                          r
                                      PK
      subject to rˆn ≥ 0 and             k=1 r
                                             ˆnk   =1                                      for n = 1, 2, . . . N
                   ηˆk ≥ 0                                                                 for k = 1, 2, . . .
                   νˆk ≥ 0 and τˆk ∈ M                                                     for k = 1, 2, . . .

Remember that these functions implicitly require the DP concentration hyperparameter γ > 0, as
well as the observation model hyperparameters τ¯, ν¯, though our notation omits them for simplicity.
   Given the internal structure of the objective, we pursue a block-coordinate ascent algorithm,
which proceeds in three steps. Each step updates the free parameters of one factor among q(z),
q(u), and q(φ) while holding the other free parameters fixed. When applied iteratively, these steps
are guaranteed to monotonically improve the whole objective L until it converges to a fixed point
local optima.
   Below, we define the optimization problem solved by each step of the block-coordinate ascent
algorithm. We then describe detailed solutions which solve each problem below in closed-form in
the following sections.

Local step: Update local assignment responsibilities


                arg max Ldata (x, rˆ, τˆ, νˆ) + Lentropy (ˆ
                                                          r ) + LDP-alloc (ˆ
                                                                           r , ηˆ)                                 (3.38)
                       rˆ
                                      PK
      subject to rˆn ≥ 0 and             k=1 r
                                             ˆnk   =1                                      for n = 1, 2, . . . N

Global step: Update observation model global parameters


            arg max Ldata (x, rˆ, τˆ, νˆ)           subject to νˆk ≥ 0 and τˆk ∈ M for k = 1, 2, . . .             (3.39)
                τˆ,ˆ
                   ν


Global step: Update allocation model global parameters


                            arg max LDP-alloc (ˆ
                                               r , ηˆ)    subject to ηˆk ≥ 0 for k = 1, 2, . . .                   (3.40)
                               ˆ
                               η
                                                                                                               60


3.4.1     Global parameter update step
Global step for observation model

As discussed in Sec. 2.1.3, the optimal observation global parameters τˆk∗ , νˆk∗ for every cluster k ∈
{1, 2, . . . K, . . .} which solve the optimization problem in Eq. (3.39) can be found in closed form:

                                                   τˆk∗ = Sk (x, rˆ) + τ¯                                   (3.41)
                                                   νˆk∗ = Nk (ˆ
                                                              r ) + ν¯

These optimal values naturally satisfy the required constraints νˆk∗ ∈ R+ and τˆk∗ ∈ M, so that νˆk∗
and τˆk∗ remain valid parameters for the density q(φk ).

Updates for inactive clusters. For every inactive cluster k > K, we have by definition Nk = 0
and Sk = 0. Thus, the optimal points reduce to the values of the prior hyperparameters: τˆk∗ = τ¯
and νˆk∗ = ν¯. Throughout our optimization, we assume that the infinitely many inactive cluster
parameters are set this way. They need not actually be represented explicitly, because they do not
influence any of the other variational free parameters or the exact calculation of the ELBO objective
function L.

Simplified Ldata objective term.              Substituting the optimal points τˆk∗ , νˆk∗ into the original objec-
tive Ldata in Eq. (3.26) for both active and inactive clusters, we see that the last two lines involving
the terms Nk + ν¯ − νˆk and Sk + τ¯ − τˆk will evaluate to zero for every cluster index k. This leaves a
greatly simplified expression:
                                                   N
                                                   X                K 
                                                                    X                                   
                     Ldata (x, rˆ, τˆ∗ , νˆ∗ ) =         h(xn ) +         cP (¯
                                                                              τ , ν¯) − cP (ˆ
                                                                                            τk∗ , νˆk∗ )    (3.42)
                                                   n=1              k=1

The reference measure term remains unchanged, while the sum over the cumulant functions is now
only over the K active clusters, not all countably infinite clusters. This occurs because for all inactive
clusters, we have τˆk∗ = τ¯ and νˆk∗ = ν¯, and thus the difference of cP functions at inactive clusters will
necessarily be zero without any explicit evaluation required.

Global step for allocation model

To find the optimal values ηˆ for the optimization problem in Eq. (3.40), we can apply standard
constrained optimization techniques like Lagrange multipliers to find the closed-form solution for
each cluster k:

                                                     ∗
                                                   ηˆk1 = Nk (ˆ
                                                              r) + 1                                        (3.43)
                                                     ∗
                                                   ηˆk0 = Nk> (ˆ
                                                               r) + γ

Naturally, these updates preserve the required constraints ηˆ ≥ 0.
                                                                                                              61


Update for inactive clusters. For any inactive cluster k > K, by definition Nk = 0 and Nk> = 0,
                                                                    ∗            ∗
which leaves the fixed points equal to the prior hyperparameters: ηˆk1 = 1 and ηˆk0 = γ. Throughout
optimization, we may assume the inactive cluster parameters are set this way. They need not be
represented in memory.

Simplified LDP-alloc objective term. Substituting the optimal points ηˆ∗ into the function
LDP-alloc in Eq. (3.30), we find that the coefficients Nk + 1 − ηˆk1 and Nk> + γ − ηˆk0 will evalu-
ate to zero, leaving a simplified expression:
                                                        K
                                                        X
                                             ∗
                            LDP-alloc (ˆ
                                       r , ηˆ ) =             cBeta (1, γ) − cBeta (ˆ
                                                                                    ηk1 , ηˆk0 )           (3.44)
                                                        k=1

where again, the difference in cumulant functions for k > K will always evaluate to zero under the
optimal settings of ηˆ∗ , and thus we need not represent these inactive terms of the sum at all.


3.4.2    Local parameter update step
To solve the local step optimization problem in Eq. (3.38), we first simplify the whole objective L by
showing it as a function of the responsibilities at each observation n. We gather those terms that do
not depend on the active responsibilities rˆ for clusters k ∈ {1, . . . K} together into a constant term:
                                                                         N
                                                                         X
                       L(x, rˆ, ηˆ, τˆ, νˆ) = const(x, ηˆ, τˆ, νˆ) +           Ln (ˆ
                                                                                   rn , xn , ηˆ, τˆ, νˆ)   (3.45)
                                                                         n=1

Now, the relevant objective function Ln , which captures all terms from Ldata , Lentropy , LDP-alloc
relevant to observation n, is given by:
                                                  K
                                                                                                  !
                                                  X
                        rn , xn , ηˆ, τˆ, νˆ) ,
                    Ln (ˆ                               rˆnk Wnk (xn , ηˆ, τˆ, νˆ) − log rˆnk              (3.46)
                                                  k=1
                                                     h              i       h          i
                      Wnk (xn , ηˆ, τˆ, νˆ) , Eq(φk ) log p(xn |φk ) + Eq(u) log πk (u)                    (3.47)

Here, we can interpret each Wnk as the log posterior weight that cluster k has on observation n.
Larger values indicate that cluster k is more likely to explain observation n. The expectations that
define Wnk have closed form when q(φk ) and q(uk ) come from our chosen exponential family forms.
   From the decomposition of L into a sum of independent terms for each observation, it is clear
that our objective may be optimized independently for each observation n.

                                        rˆn∗ = arg max Ln (ˆ
                                                           rn , xn , ηˆ, τˆ, νˆ)                           (3.48)
                                                    rˆn
                                                                       PK
                                    subject to rˆn ≥ 0 and                k=1 r
                                                                              ˆnk    =1

Through standard constrained optimization methods, we find the solution for the optimal re-
sponsibility vector rˆn∗ . We can compute the optimal vector in closed-form by setting each entry
                                                                                                       62


k ∈ {1, 2, . . . K} to the exponentiated posterior weight eWnk and then normalizing:

                                              ∗      eWnk
                                            rˆnk = PK         ,                                    (3.49)
                                                          Wnℓ
                                                    ℓ=1 e

which by inspection is guaranteed to obey the required constraints that responsibilities must be
non-negative and sum to one. By performing the update in Eq. (3.49) at each observation n inde-
pendently, we will compute the optimal responsibilities rˆ for the whole dataset.


3.4.3     Nested truncation w.r.t. number of active clusters
One advantage of our chosen truncation scheme for q(z) is that the resulting simplified family of
approximate posterior distributions is nested across truncation levels. This means for any truncation
level K and accompanying free parameters ΨK = {ˆ
                                               r , ηˆ, τˆ, νˆ} for active clusters, we can construct a
                      r′ , ηˆ′ , τˆ′ , νˆ′ } with truncation level K + 1 with exactly the same score under
configuration ΨK+1 = {ˆ
the objective L. Here is the construction:
                                                                          
          rˆnk if k ≤ K                   ηˆk1 if k ≤ K                   ηˆk0   if k ≤ K
     ′                             ′                                ′
   rˆnk =                        ηˆk1 =                           ηˆk0    =                        (3.50)
          0      if k = K + 1             1     if k = K + 1              γ      if k = K + 1
                                                                            
                                              τˆk if k ≤ K                  νˆk   if k ≤ K
                                      τˆk′ =                          νˆk′ =
                                              τ¯    if k = K + 1            ν¯    if k = K + 1

Using a variational family with nested truncation has several benefits. First, it is easy to compare
alternative approximate posteriors with different values of K: whichever has the larger score under
the objective function L is the superior model. Second, we can easily construct candidate local and
global parameters that differ from some current assignments rˆ by adding or removing one cluster.
Both these properties are used in our new proposal moves that adapt the truncation K, which we
develop in Ch. 4.

Previous work on truncation.          Our chosen nested truncation was previously suggested for topic
models (Teh et al., 2008; Bryant and Sudderth, 2012), and has been more recently applied to some
hidden Markov models (Johnson and Willsky, 2014). One well-known alternative is direct truncation
of the global parameters of the stick-breaking process (Blei and Jordan, 2006). This choice enforces
                                                                                 P
a more restrictive assumption about the last active index K: Eq [uK ] , 1 − K−1     k=1 Eq [πk ]. In
other words, the entire unit probability mass must be allocated to the first K clusters, leaving
zero probability for inactive clusters despite the DP prior always placing small positive probability
there. In addition to artificially inflating the final active cluster, this assumption leads to more
complicated updates for ηˆ when the truncation level is changed. Our nested truncation is also more
more efficient and broadly applicable than the truncation suggested by Kurihara et al. (2006), which
explicitly enforces that the aggregate probability mass assigned to inactive topics matches its prior
value.
                                                                                                      63


Algorithm 3.1 Variational coordinate ascent for DP mixture models
Input:
    x : dataset
    K: truncation level
    {ˆτk , νˆk }K
                k=1 : initial global parameters of observation model
    {ˆηk }K k=1 : initial global parameters of allocation model
    γ: allocation model hyperparameter
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆηk }K k=1 : updated global parameters of allocation model
 1: function VariationalCoordAscentForDPMix(x, K, η                 ˆ, τˆ, νˆ)
 2:       while not converged do
 3:            for n ∈ 1, 2, . . . N do                                                      ⊲ Local step
 4:                for k ∈ 1, 2, . . . K do
 5:                    Cnk ← Eq [log p(xn |φk )]
 6:                    Wnk ← Cnk + Eq [log πk (u)]
 7:                rˆn ← RespFromWeights(Wn )

 8:        for k ∈ 1,P
                     2, . . . K do                                                      ⊲ Summary step
                       N
 9:           Sk ← n=1 rˆnk s(xn )
                     PN
10:           Nk ← n=1 rˆnk
                     PK
11:           Nk> ← k+1 Nk

12:        for k ∈ 1, 2, . . . K do
13:           τˆk ← Sk + τ¯                                      ⊲ Global step for observation parameters
14:           νˆk ← Nk + ν¯
15:           ηˆk1 ← Nk + 1                                       ⊲ Global step for allocation parameters
16:           ηˆk0 ← Nk> + γ
        return ηˆ, τˆ, νˆ
Block coordinate ascent algorithm for approximate posterior inference for the DP mixture model.
Each converged solution represents a local optima of the optimization problem in Eq. (3.37).


3.5      Algorithms for variational optimization
3.5.1     Full-dataset block coordinate ascent algorithm
Alg. 3.1 defines the block-coordinate ascent update steps derived in the previous section as a coherent
algorithm. Given some initial global parameters and a dataset, it iteratively cycles between the local
step for q(zn ) in Eq. (3.48) and the global parameter updates for the observation model in Eq. (3.41)
and the allocation model in Eq. (3.43).

Summary statistics for global updates

Taking a step back, we emphasize the fundamental roles that the summary statistics N and S play in
Alg. 3.1. First, they are sufficient statistics for the global updates. This means that given N and S,
we have everything we need to compute the optimal global parameters for both the allocation model
                                                                                                   64


and the observation model. Second, these quantities N and S share two key properties: scalability
and additivity.
   First, a summary is scalable if its size is independent of the size of the dataset, measured by the
number of documents D or atoms N . Practically, this means it is cheap to store and manipulate
these quantities, even as the dataset grows toward infinite size. Performing any global update will
always be a constant-time operation with respect to the number of documents or atoms, once the
summary statistics are computed.
   Second, we say a summary is additive if it is computable by summing over independent terms,
where each term comes from either one data atom or one document (group). Additivity means that
if we want to summarize a large dataset x with N total atoms, we could first divide it into two
disjoint pieces x1 and x2 , with N1 and N2 atoms each, satisfying N1 + N2 = N . Given summaries
for the separate pieces S1k and S2k , we can compute the summary for the whole dataset simply:
SkG = S1k + S2k . Throughout, we will use the superscript G notation for whole-dataset summaries,
like N G or S G , to differentiate these from summaries for subsets of the data. When summaries are
additive, both parallel and online processing becomes possible, allowing variational inference for DP
mixture models and extensions to scale to large datasets.

Summaries for computing the objective L

For all observation models, we assume that the observation-model objective Ldata is a purely linear
function of the summaries N and S, as defined in Eq. (3.26). In contrast, the allocation-model
objective Lalloc for DP mixture models has a non-linear function of rˆ: the entropy term Lentropy (ˆ
                                                                                                   r)
in Eq. (3.29). We can create a summary statistic H, a vector with an entry for each active cluster,
for calculating this entropy:
                                                  N
                                                  X
                                         r) , −
                                     Hk (ˆ              rˆnk log rˆnk                          (3.51)
                                                  n=1
                                      PK
Then, we simply compute Lentropy =      k=1   Hk . The summary H has the properties of scalable size
and additivity, just like the others N and S.
   Given a set of possible assignment responsibilities rˆ for data x, we can compute the summaries
{S, N, H} , discard rˆ afterwards, and have everything we need to find the optimal global parameters
and compute the resulting objective score L.

Towards scalable algorithms

The variational optimization algorithm in Alg. 3.1 is sometimes called a full-dataset algorithm or
a batch algorithm, implying it performs a local step for every observation in the dataset before
each global step update. If the number of observations N in the dataset is large, for example in
the hundreds of thousands or more, this algorithm will be slow to propagate information between
the local updates and the global parameters. A natural idea is to perform global updates at more
frequent intervals, after processing a small subset or batch of data. In the sections below, we will
                                                                                                    65


Algorithm 3.2 Stochastic variational coordinate ascent for DP mixture models
Input:
    x : dataset
    K: truncation level
    {ˆτk , νˆk }K
                k=1 : initial global parameters of observation model
    {ˆηk }K k=1 : initial global parameters of allocation model
    γ: allocation model hyperparameter
    τ¯, ν¯ : observation model prior hyperparameters
    Nb : integer, desired size of each minibatch
    δ : positive real, “delay” in learning rate schedule
    κ : value in (0.5, 1.0], controls rate of learning rate decay
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆηk }K k=1 : updated global parameters of allocation model
 1: function StochasticVariationalForDPMix(x, K, η               ˆ, τˆ, νˆ)
 2:       for iteration t ∈ 1, 2, . . . do
 3:            Dt ← SampleWithoutReplacement({1, 2, . . . N }, Nb )         ⊲ Sample current minibatch
 4:            for n ∈ Dt do                                                               ⊲ Local step
 5:                for k ∈ 1, 2, . . . K do
 6:                    Cnk ← Eq [log p(xn |φk )]
 7:                    Wnk ← Cnk + Eq [log πk (u)]
 8:                rˆn ← RespFromWeights(Wn )

 9:        for k ∈ 1, 2,
                      P. . . K do                                                     ⊲ Summary step
10:           Stk ← Pn rˆnk s(xn )
11:           Ntk ← n rˆnk

12:        ξt = (δ + t)−κ                                                           ⊲ Update step size
13:        for k ∈ 1, 2, . . . K do                 
14:            τˆk ← (1 − ξt )ˆ τk + ξt τ¯ + NNb Stk          ⊲ Global step for observation parameters
                                                     
                                              N
15:            νˆk ← (1 − ξt )ˆ νk + ξt ν¯ + Nb Ntk
                                                       
16:            ηˆk1 ← (1 − ξt )ˆ ηk1 + ξt 1 + NNb Ntk
                                                  PK       
17:            ηˆk0 ← (1 − ξt )ˆ ηk0 + ξt γ + NNb ℓ=k+1 Ntℓ
        return ηˆ, τˆ, νˆ
Stochastic block coordinate ascent algorithm for solving the optimization problem in Eq. (3.37). At
each iteration, the algorithm first samples a small batch of documents and then executes local and
global optimization steps. The global step differs from the full-dataset algorithm by involving a
learning rate ξ and a natural gradient step.


describe two common methods for scaling variational coordinate ascent to large datasets: stochastic
variational inference and memoized variational inference.


3.5.2     Stochastic variational inference
Stochastic variational inference (SVI) was first introduced for topic models by Hoffman et al. (2010)
and later described more comprehensively in Hoffman et al. (2013). SVI scales to a large dataset
                                                                                                                     66


                                     kappa=0.55                    kappa=0.75                    kappa=0.95
                           0.6                           0.6                           0.6


           learning rate
                                                                                                       delay=1
                                                                                                       delay=10
                           0.4                           0.4                           0.4
                                                                                                       delay=100

                           0.2                           0.2                           0.2

                           0.0                           0.0                           0.0
                                 1   10   100 1000 10k         1   10   100 1000 10k         1   10   100 1000 10k
                                     update step t                 update step t                 update step t


   Figure 3.2: Illustration of possible learning rate schedules for stochastic variational inference.
We set the learning rate ξt at update step t to be ξt = (δ + t)−κ , where the delay parameter δ ≥ 0
downweights early iterations at larger values, and the exponential decay parameter κ determines how
fast the rate decays to zero. To obtain the Robbins-Munroe guarantees, we require that κ ∈ (0.5, 1].


of N observations by performing two conceptual steps at each iteration t: First, a local step to
find the optimal responsibilities for a small subset or batch of data. Second, a global step which
stochastically updates the global parameters to minimize L via a noisy gradient step computed using
only the current batch. Each iteration’s batch of data is sampled uniformly at random from the
whole dataset. The complete algorithm for our DP mixture model optimization problem is specified
in Alg. 3.2, and the relevant steps are detailed below. The usual formulation assumes that the whole
dataset has a finite total number of atoms N that is known in advance, but later work has extended
this to never-ending data streams (Theis and Hoffman, 2015).

SVI local step. The local step of SVI is unchanged from the full-dataset algorithm, except that
it only processes a subset of data at every iteration. Denote the subset of data indices selected at
iteration t as Dt ⊂ {1, 2, . . . N }. This set is chosen uniformly at random without replacement at
every iteration. The size of this set, which we denote Nb = |Dt |, must be pre-specified and obey
the constraints 1 ≤ Nb ≤ N . Once the set Dt is chosen, the local step update occurs: for every
data index n in the current batch, we compute the optimal responsibilities rˆn∗ for the approximate
posterior factor q(zn ) via Eq. (3.48). We emphasize that these updates exactly optimize the single
observation objective Ln . Aggregating over all observations in the set Dt , we can produce summaries
for the effective counts Ntk and effective data-statistic Stk for each active cluster in the batch at
iteration t.

SVI global step. The global step update for SVI is noticeably different from that in Alg. 3.1. As
a gradient descent algorithm, the value of a global parameter at iteration t depends on the value at
iteration t − 1. For example, global parameter τˆk will start at some initial value τˆk1 and then proceed
through a sequence of values τˆk2 , . . . , τˆkt , τˆkt+1 . . . as more batches are seen. The value at iteration
t depends on a learning rate ξ t , a positive scalar that decays over iterations. The larger ξt is, the
more “forgetful” the algorithm is of previous iterations.
   For global variables with exponential family conditionals, Hoffman et al. (2013) recommend using
                                                                                                       67


the natural gradient global step at each batch. This is the gradient under the natural parameteriza-
tion, not the mean parameterization, of the relevant exponential family density. Natural gradients
yield closed-form computations that share much in common with full-dataset updates. For each
global parameter, the natural gradient update has three steps, which we walk through below for the
data shape parameter τˆk .
                                                                                      N
   First, create a noisy estimate of the full-dataset summary statistic Sk ≈          Nb Stk .   Here, the
                       N
amplification factor   Nb   ≥ 1 scales up the current batch’s statistic to have the effective size of the
full dataset. Next, using this estimated summary to perform a global step, creating a batch-specific
                                 N
global parameter τˆtk ≈ τ¯ +     Nb Sbk .   Finally, interpolate between the batch-optimal value and the
previous global value with the current learning rate ξt .
                                              N     
                                 τˆk = ξt τ¯ +    Stk + (1 − ξt )ˆ
                                                                 τk                                 (3.52)
                                               Nb
When the learning rate decays according to the Robbins-Munro conditions, the stochastic vari-
ational algorithm provably converges to a local optimum of the whole-dataset objective function
L (Hoffman et al., 2013). Hoffman et al. (2013) suggest defining the learning rate decay schedule
such that ξt = (δ + t)−κ , where the parameter δ ≥ 0 acts as a delay where larger values downweight
the impact of early iterations, and parameter κ ∈ (0.5, 1] is a decay factor that determines how fast
the step size ξt approaches zero as t increases. See Fig. 3.2 for a visualization of how the learning
rate ξt decays as the number of iterations t grows for a variety of possible decay schedules.
   SVI has several advantages over full-dataset inference, including much faster propagation between
local and global updates and a potential for the random gradient steps to help escape very shallow
local optima that might trap the full-dataset method. However, performance is extremely sensitive
to the how aggressively the step size ξt decays as more updates occur. The decay schedule for this
parameter often has nuisance parameters that must be tuned for ideal performance, which can be
prohibitively expensive. Performance can also be sensitive to chosen batch size Nb .
   When conditional distributions do not come from the exponential family, we may still perform
stochastic updates by computing the gradient of the objective L with respect to the global parameter
of interest. However, this can require costly derivation effort by the practioner.


3.5.3    Memoized variational inference
Memoized variational inference (MVI) is an alternative to SVI that is just as scalable but avoids
the nuisance of learning rates and derivation complexities altogether. MVI was first introduced in
Hughes and Sudderth (2013) and inspired by the incremental expectation-maximimization algorithm
of Neal and Hinton (1998). With just one batch, MVI is exactly equal to whole-dataset inference.
However, if the dataset is divided into B smaller batches, MVI can be much more scalable than the
full-dataset algorithm.
   Like SVI, each update iteration consists of a local step at a single batch followed immediately by
a global step. Unlike SVI, each global update uses aggregated summaries that represent the whole
dataset, not just the current batch. These whole-dataset summaries are incrementally updated at
                                                                                                      68


Algorithm 3.3 Memoized variational coordinate ascent for DP mixture models
Input:
    x : dataset, divided into B fixed batches
    K: truncation level
    {ˆτk , νˆk }K
                k=1 : initial global parameters of observation model
    {ˆηk }K k=1 : initial global parameters of allocation model
    γ: allocation model hyperparameter
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆηk }K k=1 : updated global parameters of allocation model
 1: function MemoizedVariationalForDPMix(x, K, η                ˆ, τˆ, νˆ)
 2:       N G, S G ← 0                                                     ⊲ Initialize global statistics
 3:       for batch b ∈ 1, 2, . . . B do
 4:            N b , Sb ← 0                                                ⊲ Initialize batch statistics

 5:    for lap ℓ ∈ 1, 2, . . . Lmax do
 6:       for batch b ∈ Shuffle(1, 2, . . . B) do
 7:           for n ∈ Db do                                                    ⊲ Local step for batch b
 8:               for k ∈ 1, 2, . . . K do
 9:                   Cnk ← Eq [log p(xn |φk )]
10:                   Wnk ← Cnk + Eq [log πk (u)]
11:               rˆn ← RespFromWeights(Wn )

12:            S G ← S G − Sb                                    ⊲ Decrement previous batch statistics
13:            N G ← N G − Nb

14:            for k ∈ 1, 2,
                          P. . . K do                                      ⊲ Summary step for batch b
15:               Sbk ← Pn∈Db rˆnk s(xn )
16:               Nbk ← n∈Db rˆnk

17:            S G ← S G + Sb                                         ⊲ Increment new batch statistics
18:            N G ← N G + Nb

19:          for k ∈ 1, 2, . . . K do
20:               τˆk ← SkG + τ¯                             ⊲ Global step for observation parameters
21:               νˆk ← NkG + ν¯
22:               ηˆk1 ← 1 + NkG                               ⊲ Global step for allocation parameters
                               PK
23:               ηˆk0 ← γ + ℓ=k+1 NℓG
       return ηˆ, τˆ, νˆ
Block coordinate ascent algorithm for solving the optimization problem in Eq. (3.37). Assumes data
is divided into a fixed set of B batches before processing begins. At each batch, we perform a local
optimization step and a global optimization step. For the DP mixture model, this algorithm is
guaranteed to monotonically increase the ELBO objective function L after every step once the first
lap through the data is completed.


each batch. With a modest cost of additional memory required for tracking batch-specific summaries
that enable these incremental updates, we can perform noise-free global steps that exactly optimize
                                                                                                     69


the full-dataset objective L. The target application of MVI is datasets of large but finite size through
which we can afford to make dozens but maybe not thousands of update iterations. Variants of MVI
easily generalize to streaming settings, but our experiments show that when it is affordable to do
so, making dozens of repeated passes through a dataset yields substantially better results.

MVI for fixed-size datasets.       Given a fixed dataset of D total documents, we must divide it into
B distinct batches before memoized inference can begin. Any assignment of documents to batches
may be used; random assignment is effective but not necessary. We need only require that the same
batches are used throughout the algorithm, as we make repeated passes through the dataset. We
denote the set of atom indices n assigned to batch b as Db .
   Alg. 3.3 provides the formal steps of the standard memoized training algorithm for our DP
mixture model. The algorithm completes many passes through the complete dataset, processing
one batch at a time. We call each pass through the dataset a lap. In each lap, we visit each batch
exactly once. Any ordering is allowed so long as all batches are visited evenutally. As a default, we
recommend a different random order for each lap.
   When visiting batch b, MVI first performs a local step to obtain batch specific summaries
Nb , Sb , Hb . These fresh values are then used to incrementally update whole-dataset summaries
N G , S G , H G , where again the superscript G indicates the summary value is a global aggregation
of all batch values. During the first visit to batch b, each global summary is simply incremented:
NkG = NkG + Nbk . On later visits, each global summary update requires adding the new batch-
                                                                            memory
summary and subtracting the previous batch-specific value: NkG = NkG +Nbk −Nbk     . After every
                                                                                    memory
incremental update of a global summary, we rewrite the memoized value for batch b: Nbk     = Nkb .
After each batch-specific local step and resulting incremental summary step, the global summaries
N G , S G are fully consistent with the most recent assignments to all batches. We may thus use these
summaries in a global step to obtain optimal global parameters, just as in the standard full-dataset
algorithm.
   Overall, memoized inference has the same per-batch run-time complexity as stochastic variational
inference. However, it does have modest additional storage cost that grows linearly with the number
                                                                             memory
of batches B and number of clusters K – O(BK) – due to required storage for Nbk     and other
memoized batch summaries. Importantly, this storage requirement does not need to scale directly
with the number of documents or data atoms, only the number of batches.
   Memoized inference is advantageous because each global step delivers global parameters that are
optimal under the whole-dataset objective, avoiding the noisiness inherent in stochastic updates and
the complexities of selecting a learning rate. Furthermore, during MVI inference we can exactly
evaluate the whole-dataset objective L at any point after the first complete lap. This allows the
design of birth and merge proposal moves that are checked exactly against the whole-dataset objec-
tive, and only accepted if they improve the L score. This guaranteed improvement makes our merge
moves more reliable than earlier work that made decisions about removing components based on a
noisy objective estimate based only on a single batch (Bryant and Sudderth, 2012).
                                                                                                      70


Memoized algorithm and monotonicity guarantees

For the DP mixture model, the memoized algorithm is guaranteed to monotonically improve the
objective L at every iteration. Crucially, this can be done without storing the complete responsiblities
rˆ for the full dataset, and instead only computing the per-batch responsibilities rˆb on demand. The
reason is that the local step objective for rˆb in Eq. (3.38) is convex and therefore has one universal
optimum given fixed global parameters τˆ, νˆ and ηˆ.
   The guaranteed increase of the objective function L provides a useful practical test to verify the
correctness of any implementation. Any implementation which yields a non-monotonically increasing
L trace must have a bug. No such simple test exists for stochastic variational methods.

Memoized algorithm for streaming

It is straightforward to apply the ideas behind memoized inference to applications that require
learning from a never-ending stream of data. At more data arrives, we may simply visit each never-
before-seen batch b, execute a local step that outputs batch-specific summaries like Nbk , and use an
incremental update NkG = NkG + Nbk to obtain whole-dataset statistics exactly consistent with all
data seen thus far. The aggregated summaries can be used in standard global step to deliver global
parameters that are optimal given the local assignments from all previously-visited batches. This
procedure is equivalent to a synchronous version of streaming variational inference presented in Alg.
3 of Broderick et al. (2013). Broderick et al. (2013) also present a version of the algorithm which
iterates at each batch b through many local-then-global steps until convergence. If a batch is never
visited more than once, no memory for per-batch summaries is needed.


3.5.4    Initialization of global parameters
Our iterative optimization algorithm requires an initialization of the global variational parameters
for the observation model νˆ, τˆ and allocation model ηˆ. Any initialization that provides parameter
values in the correct domain is valid, but some are better than others as later experiments will show.
   To initialize all variational training algorithms for DP mixtures, we recommend using the Breg-
man k-means++ procedure from Alg. 2.1. This algorithm applies to the entire class of exponential-
family observation models and yields solutions for which some formal approximation ratios have
been proved for simpler but related objectives.
   Many alternatives to initialize these global cluster parameters could exist. For example, Future
work could investigate the alternative choice of using a small-variance asymptotics algorithm like
DP-means, which can also handle all exponential family (EF) observation models (Jiang et al., 2012).

Initializing MVI.     Initializing MVI requires specifying global parameter vectors νˆ, τˆ, and ηˆ. These
are all we need to perform a local step at the first batch. However, during the first lap each global
step uses summaries aggregated only from visited data batches, not the entire dataset. For example,
after the first batch the summaries provided to the global step represent only |D1 | observations,
                                                                                                       71


not all N observations. Consequently, after seeing only one batch, the global step is optimizing the
global parameters only for the terms in the objective related to contents of one batch, not the full
dataset. Every global step update during the first lap optimizes a different, but related, objective.
This evolving objective converges to the full-dataset objective after the completion of the first lap.
   Consider the case of trying to initialize MVI using a smart, data-informed guess for global param-
eters. Perhaps this guess comes from running MVI on a related dataset, or uses some hypothesized
“ground-truth” parameters. The data in the first few batches may not be completely representative
of the clusters present this initialization. Not every batch will contain many examples of useful but
rare clusters. Thus, naively employing the global step during the first few batches can yield new pa-
rameters that are quite far from our smart initial guess. If some initial cluster that is useful for later
batches does not appear in the first batch, it will be completely erased after batch one and never be
recovered by later batches. The greedy nature of the few global steps of MVI can yield parameters
that underperform on the full-dataset objective when unseen batches are taken into account.
   We first and foremost hope that our birth proposal moves fix this problem by creating new, useful
clusters as needed. However, if birth moves are unavailable and an informative initialization is used,
we recommend delaying any global steps until sufficiently many batches are seen, perhaps even the
full dataset if it is small enough. Until the total number of units processed exceeds a user-specified
threshold, at each batch we perform local steps and increment global summaries, but simply keep
the global parameters at their initial values. However, without proposal moves we find this delay
strategy can often dramatically improve performance.


3.6      Experimental results
In this section, we demonstrate the basic behavior of our full-dataset variational training algorithm
on several toy datasets. Our goal is to help the reader understand how performance varies with
initialization, with the number of initial clusters K, and with the choice of the observation model
(“likelihood”). A specific take-away from all the experiments below should be that these algorithms
are highly-vulnerable to local optima, even when using data-driven initialization techniques and
large values for the number of initial clusters K. We hope this motivates the exploration of proposal
moves in Ch. 4 that make large changes by adding or removing clusters.
   Several experiments with the scalable memoized or stochastic algorithms are available in the
experiments section of Ch. 4, where we explore how these training methods perform with and
without our new proposal moves.


3.6.1     Clustering image patches with zero-mean Gaussian likelihoods
Motivated by success of Gaussian mixture models for image patches by (Zoran and Weiss, 2011),
as illustrated earlier in Fig. 1.2, we consider training DP mixture models with zero-mean Gaussian
likelihoods.
                                                                                                                                                                                                                                                               72


                                       Local optima with random init                                                                                           Local optima with Bregman++ init
                         3.08                                                                                                                           3.08
     train ELBO


                                                                                                                                     train ELBO
                         3.06                                                                                                                           3.06

                         3.04                                                                                                                           3.04

                         3.02                                                                                                                           3.02

                         3.00                                                                                                                           3.00
                                          26*        50                   100                  150                     200                                               26*        50                   100                   150                  200
                                                          num initial clusters K                                                                                                       num initial clusters K
                           Kinit=40                     Kinit=80                  Kinit=150                  Kinit=200                                         K=40                        K=80                       K=150                      K=200
num eff. topics K


                                                                                                                               num eff. topics K
                    28                          28                         28                         28                                           28                          28                         28                          28
                    26                          26                         26                         26                                           26                          26                         26                          26
                    24                          24                         24                         24                                           24                          24                         24                          24
                    22                          22                         22                         22                                           22                          22                         22                          22
                    20                          20                         20                         20                                           20                          20                         20                          20
                    18                          18                         18                         18                                           18                          18                          18                         18
                         10     100    1000          10     100   1000           10   100   1000            10   100   1000                             10     100    1000          10     100    1000           10   100    1000           10   100    1000
                         train time (sec)            train time (sec)           train time (sec)            train time (sec)                            train time (sec)            train time (sec)            train time (sec)            train time (sec)

                              Kinit=40                    Kinit=80                Kinit=150                  Kinit=200                                       Kinit=40                    Kinit=80                 Kinit=150                  Kinit=200
                    3.08                        3.08                      3.08                       3.08                                          3.08                        3.08                       3.08                       3.08
train ELBO


                                                                                                                               train ELBO
                    3.06                        3.06                      3.06                       3.06                                          3.06                        3.06                       3.06                       3.06
                    3.04                        3.04                      3.04                       3.04                                          3.04                        3.04                       3.04                       3.04
                    3.02                        3.02                      3.02                       3.02                                          3.02                        3.02                       3.02                       3.02
                    3.00                        3.00                      3.00                       3.00                                          3.00                        3.00                       3.00                       3.00
                           10    100    1000           10   100   1000           10   100   1000            10   100   1000                               10    100    1000           10    100   1000           10    100   1000           10    100   1000
                           train time (sec)            train time (sec)          train time (sec)           train time (sec)                              train time (sec)            train time (sec)           train time (sec)           train time (sec)


                                                                                                                                                          best of 25 median of 25 worst of 25


  Figure 3.3: Local optima found when clustering toy alphabet images with DP mixture models.
Dataset: 50000 synthetic images generated from 26 true clusters (one for each letter of the English
alphabet: A, B, C, ... Z). We compare initializations for the variational inference algorithm in
Alg. 3.1 and expose local optima problems. Left column: random initialization, which chooses
K initial images uniformly at random without replacement to initialize K clusters. Right column:
Bregman k-means++ initialization, which chooses K initial documents as cluster centers via Alg. 2.2.
Bottom Row: Best, median, and worst of 25 runs, as ranked by L, from K = 100 initial clusters.
                                                                                                       73


   First, we consider a synthetic dataset with a set of K = 26 true clusters, each one generating
8x8 pixel image patches which show a single image from the English alphabet (“A”, “B”, “C”, etc.).
Each true cluster is defined by a covariance matrix φk over the image patch space, as well as a
frequency πkG , which we set to be monotonically decreasing with a letter’s index in the alphabet, so
A is most common and Z is least common. Each successive letter has 95% of the probability mass
of the previous letter.
   After randomly generating N = 50, 000 distinct observations from this model, we observed 3695
“A“ patches, 3272 “B” patches, and 4929 “C” patches. Among the least common letters, we observed
955 “X” patches, 771 “Y“ patches, and 529 “Z“ patches. Fig. 3.3 shows the results of extensive local
optima experiments on the task of clustering this synthetic dataset. We consider several factors
governing the initialization: the number of initial clusters and the cluster selection procedure. As
a baseline, we consider a random example initialization, where K distinct data observations are
selected uniformly at random without replacement from the index set {1, 2, . . . N }, and these K
choices are used to initialize separate clusters. In contrast, we consider the Bregman k-means++
procedure in Alg. 2.2, which selects K observations in a data-driven fashion which tends to find
K observations far from each other using the Bregman divergence that comes from our chosen
zero-mean Gaussian likelihood.
   Our conclusions are listed below:

Rich-get-richer prior bias lets some initial clusters decay to zero usage. To understand
what happens here, in the second row of Fig. 3.3 we plot the number of effective clusters for each
method, which is defined as the number of active clusters assigned to at least 1 data atom across the
entire dataset: Keff = |{k : Nk (ˆ
                                 r ) ≥ 1}. We see that even with over 100 initial clusters actively used
after the first few steps, the updates of Alg. 3.1 enforce the rich-get-richer prior bias of the Dirichlet
process prior and gradually favor only a small subset of the total number of initial clusters while
letting the rest decay to zero usage. The original number of clusters is still actively represented in
memory, just with some parameters set for zero usage.

Initializations with near the true number of clusters rarely do well. Using many more
clusters is consistently better. Across all initializations, using a large number of initial clusters
K > 100 relative to the true number of clusters K = 26 leads to much better ELBO scores than say
K = 40 or K = 60. For an initialization with 30 clusters to do well at capturing the ground-truth
trends, it must adequately represent all 26 true clusters. This requires a very lucky selection of
examples for the initialization, even for the data-driven Bregman++ procedure. We see in practice
that until we have over 100 clusters, we do not reliably recover the true clusters even in the best
of 25 initializations. Using a large initialization hugely increases the chances that an initial cluster
representing each of the true clusters will be selected.

The number of initial clusters matters more than the initialization procedure.                      Com-
paring the local optima plots (top row) for both random initialization and Bregman k-means++
                                                                                                   74


initialization, we see the same dominant trend in both plots: using a large initial number of clus-
ters leads to reliably good performance. Any differences between these procedures is of secondary
importance.

Bregman k-means++ initialization shows modest improvement over random initializa-
tion. Consider the top row of Fig. 3.3, which shows the final training ELBO value found from each
of 25 independent initializations across many different initial number of clusters. Looking closely at
K = 40 or K = 80, we see that the Bregman++ initialization has much lower variance than the
random example initialization, consistently yielding larger ELBO values. We might have expected
a larger difference, but the data-driven method does indeed offer visibly better performance.


3.6.2      Clustering documents with multinomial likelihoods
We now consider the task of clustering data observations which represent entire text documents.
Using the DP mixture model, we might hope to learn informative single-membership partitions of
news articles which might be post-hoc interpreted as “political news” or “economics” or “film.”
Here, we take each observation xn to be a bag-of-words vector. That is, xn ∈ RV is a vectors of
non-negative integer word counts from a predefined vocabulary. We use a multinomial likelihood
model with corresponding Dirichlet prior.

Toy Bars experiments

We first consider a “toy bars” experiment, inspired by Griffiths and Steyvers (2004). We arrange the
900-word vocabulary into a 30x30 grid, where each word represents one image pixel. Each cluster
has a strong probability of producing words from one horizontal or vertical stripe or “bar” on this
grid. There are 10 true bar clusters. We generate documents xn directly from the finite mixture
generative model, with roughly equal probability for each bar.
   Fig. 3.4 shows the results of extensive local optima experiments on the task of clustering this
synthetic dataset. Like the earlier experiments, we compare the impact of the number of initial
clusters and the chosen initialization procedure. Our conclusions are:

For multinomial likelihoods, local optima prevent rich-get-richer prior bias from en-
couraging cluster usage decay.        In Fig. 3.4 for zero-mean Gaussian likelihoods, we saw that
large K initializations have many clusters with usage dropping to zero. In contrast, here in Fig. 3.4
we see that K = 40 initializations with 4x the number of true clusters consistently see no cluster
that becomes zeroed out across many runs. Because the multinomial likelihood is very flexible, it
can exactly match the relevant mean of any member document and thus discourage usage dropping
to zero.

Too many clusters incur substantial penalty in ELBO. Results with K = 40 initial clusters
doubtlessly include multiple copies of all 10 true bars. These runs achieve ELBO values which are
                                                                                                                                                                                                                     75


                      −5.6
                                  Local optima with random init                                                                 −5.6
                                                                                                                                      Local optima with Bregman++ init
                      −5.7                                                                                                      −5.7
    train ELBO


                                                                                                             train ELBO
                      −5.8                                                                                                      −5.8

                      −5.9                                                                                                      −5.9

                      −6.0                                                                                                      −6.0

                      −6.1                                                                                                      −6.1

                      −6.2                                                                                                      −6.2

                                            10              20            30            40           50                                               10             20            30            40            50
                                                   num initial clusters K                                                                                    num initial clusters K
                              K=10                            K=20                         K=40                                         K=10                            K=20                          K=40
 num eff. topics K


                                                                                                           num eff. topics K
                     50                              50                          50                                            50                              50                         50
                     40                             40                           40                                            40                              40                         40
                     30                              30                          30                                            30                              30                         30
                     20                              20                          20                                            20                              20                         20
                     10                              10                          10                                            10                              10                         10
                      0                              0                            0                                             0                               0                          0
                              1       10    100               1      10   100              1   10    100                                1       10    100               1     10   100              1     10   100
                          train time (sec)                train time (sec)             train time (sec)                             train time (sec)                train time (sec)            train time (sec)

                                  K=10                            K=20                       K=40                                           K=10                            K=20                      K=40
                     −5.6                           −5.6                        −5.6                                           −5.6                           −5.6                       −5.6
                     −5.7                           −5.7                        −5.7                                           −5.7                           −5.7                       −5.7
 train ELBO


                                                                                                           train ELBO
                     −5.8                           −5.8                        −5.8                                           −5.8                           −5.8                       −5.8
                     −5.9                           −5.9                        −5.9                                           −5.9                           −5.9                       −5.9
                     −6.0                           −6.0                        −6.0                                           −6.0                           −6.0                       −6.0
                     −6.1                           −6.1                        −6.1                                           −6.1                           −6.1                       −6.1
                     −6.2                           −6.2                        −6.2                                           −6.2                           −6.2                       −6.2
                                  1    10    100                 1   10   100              1    10   100                                    1    10    100                1   10   100                1   10   100
                            train time (sec)               train time (sec)            train time (sec)                               train time (sec)               train time (sec)           train time (sec)

                            random init: best result (1st of 10)                                                                    Bregman init: best result (1st of 10)


                          random init: worst result (10th of 10)                                                                Bregman init: worst result (10th of 10)


       Figure 3.4: Local optima found when clustering toy bars data with DP mixture models.
Dataset: 100 synthetic documents generated from 10 true bars topics via a true finite mixture
model. We compare various initializations for the variational inference algorithm in Alg. 3.1 with
the goal of exposing local optima problems. Left column: random initialization, which chooses K
initial documents uniformly at random without replacement to initialize K clusters. Right column:
Bregman k-means++ initialization, which chooses K initial documents as cluster centers via Alg. 2.2.


roughly half-way between the best score of the ideal K = 10 clusters and using only K = 5 clusters.
Thus, the DP mixture prior leads to harshly penalizing redudant and irrelevant active clusters, even
if the update steps of the algorithm cannot escape such local optima.
                                                                                                                                                                                                                     76


                                 Local optima with random init                                                                       Local optima with Bregman++ init
                      −7.6                                                                                                    −7.6
    train ELBO


                                                                                                            train ELBO
                      −7.7                                                                                                    −7.7


                      −7.8                                                                                                    −7.8

                                            100              200             300                   400                                              100               200             300              400
                                                num initial clusters K                                                                                   num initial clusters K
                               K=100                       K=200                       K=400                                           K=100                        K=200                        K=400
 num eff. topics K


                                                                                                         num eff. topics K
                     400                          400                        400                                             400                          400                          400

                     300                          300                        300                                             300                          300                          300

                     200                          200                        200                                             200                          200                          200

                     100                          100                        100                                             100                          100                          100

                       0                            0                          0                                               0                            0                            0
                        1       10    100            1      10     100          1       10     100                                     1     10    100               1    10    100               1    10    100
                           train time (sec)             train time (sec)            train time (sec)                               train time (sec)             train time (sec)              train time (sec)

                                K=100                       K=200                       K=400                                           K=100                        K=200                        K=400
 train objective


                                                                                                         train objective
                     −7.65                      −7.65                       −7.65                                            −7.65                        −7.65                       −7.65


                     −7.70                      −7.70                       −7.70                                            −7.70                        −7.70                       −7.70


                     −7.75                      −7.75                       −7.75                                            −7.75                        −7.75                       −7.75
                          1     10   100             1      10   100             1     10    100                                  1     10   100               1     10   100              1     10   100
                             train time (sec)            train time (sec)           train time (sec)                                 train time (sec)             train time (sec)            train time (sec)


      Figure 3.5: Local optima found when clustering NIPS articles with DP mixture models.
Dataset: 1392 articles published in the proceedings of the Neural Information Processing Systems
(NIPS) academic conference. We compare various initializations for the variational inference algo-
rithm in Alg. 3.1 with the goal of exposing local optima problems. Left column: random initializa-
tion, which chooses K initial documents uniformly at random without replacement to initialize K
clusters. Right column: slightly-better Bregman k-means++ initialization, which chooses K initial
documents as cluster centers via Alg. 2.2.


Data-driven Bregman initialization yields slightly better results than random.                                                                                                                                     Com-
paring the top row plots of Fig. 3.4 shows a noticeably reduced spread in ELBO performance for
Bregman k-means++ initialization compared to random initialization across runs no matter what
initial number of clusters is used.

NIPS experiments

Fig. 3.5 shows the results of extensive local optima experiments on the task of clustering articles
from the proceedings of the Neural Information Processing Systems (NIPS) conference. This is a
popular benchmark dataset among topic modelers. Our conclusions are:

For this dataset, hundreds of clusters seem useful. Across both initialization methods, the
best scores are achieved with nearly 300 or 400 active clusters at initialization.

For multinomial likelihoods, local optima prevent rich-get-richer prior bias from en-
couraging cluster usage decay.                                                         Looking at the number of effective topics plots in the second
                                                                                                    77


row of Fig. 3.5, we see that across 10 runs the number of topics assigned to at least one observation
never decreases from its initial value. This confirms our earlier toy data experiment.

Data-driven Bregman++ initialization does not seem visibly better than random.
Somewhat surprisingly, except when the number of clusters is very small (K = 50), the set of
final ELBO values across multiple initializations does not seem to differ in spread or mean between
the two methods.


3.7     Discussion
Our full-dataset variational algorithm from Alg. 3.1 and its two more scalable alternatives, stochastic
(Alg. 3.2) and memoized (Alg. 3.3) represent promising training algorithms when given a reasonable
initialization. However, our experiments here demonstrate that achieving good initializations can
actually be surprisingly complicated and that as long as the number of clusters K remains fixed
these algorithms are sorely limited in their ability to escape from a poor initialization. In the next
section, we show that additional proposal moves that add and remove clusters during training lead
to greatly improved performance.
Chapter 4

Proposal moves to escape local
optima for DP mixture models

Both standard full-dataset variational inference and scalable variants like stochastic (SVI) and mem-
oized (MVI) can get stuck in poor local optima, as shown in the experiments of Sec. 3.6. This can
occur in part because each local or global step only updates a limited subset of the variational
parameters. To escape the current basin of attraction for better regions of the parameter space, we
need proposal moves that can change all free parameters jointly, making bigger changes than the
coordinate ascent steps of local and global updates. In this chapter, we describe several possible
proposals moves specifically for the Dirichlet process (DP) mixture model. In later chapters, we will
discuss how these moves might change for more complex models.
   We consider three possible proposal moves to escape local optima, some that remove clusters
and others that add clusters. First, merge moves eliminate redundancy by combining two current
clusters into a single combined cluster. Second, birth moves try to introduce new clusters in data-
informed fashion to improve the explanation of the data. Finally, delete moves remove a single junk
cluster by reassigning its mass across many remaining clusters. Each proposal move is executed by
first transforming some original configuration of variational parameters into a candidate new config-
uration, and then deciding whether this new state improves the variational optimization objective
L. The correctness of these moves follows simply from the fact that they are not accepted unless
the objective L improves.
   Several previous efforts (Ueda et al., 2000; Bryant and Sudderth, 2012) employ similar pro-
posal moves to escape local optima for specific Bayesian models within optimization-based al-
gorithms. Ueda et al. developed split and merge moves for Baysian finite mixture models for
an expectation-maximization algorithm (Ueda et al., 2000) as well as a full variational algorithm
(Ueda and Ghahramani, 2002). Beal and Ghahramani (1999) provided a variational algorithm for
training mixtures of factor analyzers which could add new clusters, albeit without any formal guar-
antees about improving the objective function. While interesting, these foundational efforts had

                                                 78
                                                                                                             79


several limitations: they examined finite mixture models rather than Bayesian nonparametric mod-
els, they did not pursue scalable algorithms at all, and did not include anything like a delete proposal.
In contrast, we make scalable construction and evaluation a priority, and also show that additional
delete moves are required to escape some local optima.
   Among stochastic variational methods, Bryant and Sudderth (2012) developed split and merge
moves for stochastic variational inference of topic models. While these moves are somewhat effective,
we find again that our delete move is crucially more powerful than these approaches. Furthermore,
as shown later in our study of topic models in Ch. 5 using proposals with SVI can yield pathological
uncontrolled growth in the number of clusters due to the lack of rigorous guarantees about improving
the objective function L. Our approach leads to much more reliable behavior.
   Finally, a thread of research has used split and merge proposals within MCMC sampling training
algorithms. Jain and Neal (2004) were the first to deploy split and merge moves, while recent work
has seen many specialized proposals for DP mixtures (Chang and Fisher III, 2013) and HDP topic
models (Chang and Fisher III, 2014). The requirement of maintaining detailed-balance within the
Markov chain of posterior samples can often hinder the design of sampler-based split and merge
moves for new models. Furthermore, these proposals are also difficult to scale to large datasets,
though Chang and Fisher III (2014) have invested some laudable implementation effort in paral-
lelization and other improvements.
   To summarize, our goal is to develop proposal moves that can be constructed and evaluated
using minimal computation in a batch-by-batch processing fashion, have a unified framework across
several Bayesian nonparametric models, and preserve some guarantees about monotonic increase
in the optimization objective L when possible. Preliminary versions of proposal moves for the DP
mixture model were presented in our NIPS 2013 conference paper (Hughes and Sudderth, 2013).
However, we have since made several improvements, such as formal guarantees of ELBO objective
score improvement when a birth is accepted.


4.1      Merge moves for DP mixture models
The goal of the merge proposal is to create a candidate set of the variational free parameters
rˆ′ , ηˆ′ , τˆ′ , νˆ′ which modifies the current state rˆ, ηˆ, τˆ, νˆ by combining two chosen clusters kA and kB
into a single cluster. As shown in Fig. 4.1, this can remove redundant clusters and lead to more
compact models that are simpler to interpret and faster to train (because algorithm runtime scales
linearly with K).


4.1.1     Merge proposal construction
Consider a current model with truncation level K. Let integers kA and kB be the labels of the pair of
clusters we wish to try mergin. Formally, we have 1 ≤ kA < kB ≤ K. We will assemble the proposed
set of variational free parameters rˆ′ , ηˆ′ , τˆ′ , νˆ′ in two steps. First, we create rˆ′ via a deterministic
merge step from the original responsibilities rˆ. This new set of clusters will only have K − 1 clusters.
                                                                                                         80


     Original State:
                                                                       τ , νˆ)
                                                                   q(φ|ˆ
                                      rˆ
                                           q(π G |ˆ  ˆ)
                                                  u, ω
                                      Step 1
     Proposal:
                                                                     τ ′ , νˆ′ )
                                                                 q(φ|ˆ
                                    rˆ′           Step 2


                                                  u′ , ω
                                           q(π G |ˆ    ˆ ′)


                          rˆn            .01       .02        .10       .87


                          rˆn′           .01       .02        .97

                      Figure 4.1: Illustration of merge proposal for DP mixtures.
Top: Holistic view of merge proposal construction. We start with an existing model with 4 active
clusters. This original state is captured by the local assignment responsibilities rˆ, the cluster fre-
quency parameters u   ˆ, ω
                         ˆ , and the observation model parameters τˆ, νˆ. The first step is a deterministic
construction of new assignment responsibilities rˆ′ , where the values for red and purple clusters are
combined. The second step is construction of new global parameters τˆ′ , νˆ′ , u  ˆ′ , ω
                                                                                       ˆ ′ given the respon-
             ′
sibilities rˆ . The complete model has only 3 active clusters. Bottom detail: Example deterministic
construction of merged responsibilities. Any responsibility mass originally cluster labels kA (red)
and kB (purple) is merged via addition into a single entry of the new responsbility vector.
                                                                                                             81


Second, we use the closed-form global parameter updates to arrive at the corresponding global
parameters. That is, we find the optimal allocation parameters ηˆ′ via Eq. (3.43), and the optimal
observation parameters τˆ′ , νˆ′ via Eq. (3.41).

Merge construction for local responsibilities

Our chosen rule for constructing candidate rˆ′ from the existing responsibilities rˆ is simply to take
any posterior mass in q(z) assigned to cluster kB and reassign this mass to the cluster kA , while also
retaining any existing mass on cluster kA . This effectively combines these two clusters into a single
cluster with label kA . To deal with the removal of the cluster label kB , we shift all larger cluster
indices k > kB down by one. This leaves an effective set of new responsibilities rˆ′ with truncation
level K − 1.
                                                          
                                                          rˆ   + rˆnkB    if k = kA
                                                           nkA
                                                          
                                                          
                                                    ′
                    for k ∈ 1, 2, . . . K − 1 :   rˆnk   = rˆnk            else if k < kB                 (4.1)
                                                          
                                                          
                                                          
                                                          rˆ
                                                             nk+1          else if k ≥ kB

Merge construction of summary statistics for local responsibilities

Given proposed responsibilities rˆ′ for K − 1 clusters, we need to compute the summary statistics
  r′ , x), N (ˆ
S(ˆ           r′ ) defined in Eq. (3.35). These statistics play a fundamental role in later global parame-
ter updates and ELBO objective evaluation for our proposed configuration. Because these statistics
are linear functions of the responsibilities rˆ′ , they need not be computed directly from the new re-
sponsibilities rˆ′ via Eq. (3.35), which would require O(N K) operations. Instead, they can be directly
computed from the existing summary statistics S(ˆ       r ), N (ˆ r ) for each of the K original clusters:
                                                      
                                                       N (ˆ   r ) + NkB (ˆ r ) if k = kA
                                                       kA
                                                      
                                                      
             for k ∈ 1, 2, . . . K − 1 : Nk (ˆ  r′ ) = Nk (ˆ r)                    else if k < kB          (4.2)
                                                      
                                                      
                                                      
                                                      N (ˆ
                                                          k+1 r )                  else if k ≥ kB
                                                      
                                                       S (ˆ  r , x) + SkB (ˆ r , x) if k = kA
                                                       kA
                                                      
                                                      
           for k ∈ 1, 2, . . . K − 1 : Sk′ (ˆ
                                            r′ , x) = Sk (ˆ r , x)                     else if k < kB
                                                      
                                                      
                                                      
                                                      S (ˆ
                                                         k+1 r , x)                    else if k ≥ kB

                                                                                                r′ )
This direct construction from existing summaries requires only one sum operation for each of N (ˆ
      r′ , x), which is an O(1) operation independent of the number of observations N . Thus, it is
and S(ˆ
much more efficient than the O(N K) cost of evaluating Eq. (3.35).

Merge construction of entropy statistics for local responsibilities

                                                       r′ ) defined in Eq. (3.29) is a non-linear function
Unlike the statistics above, the entropy statistic Hk (ˆ
of rˆ. Thus, while some entries of the original entropy statistic vector H(ˆ
                                                                           r) can be reused after ap-
propriate index shifting, the entry corresponding to the newly-merged cluster kA must be computed
                                                                                                               82


anew from the corresponding column of rˆ′ .
                                                         P
                                                            N
                                                         
                                                                −ˆr′ log rˆnk
                                                                            ′
                                                                                   if k = kA
                                                          n=1 nk
                                                         
                                                   r′ ) = Hk (ˆ
                   for k ∈ 1, 2, . . . K − 1 : Hk (ˆ          r)                   else if k < kB           (4.3)
                                                         
                                                         
                                                         
                                                         H
                                                             k+1(ˆ
                                                                 r)                else if k ≥ kB

                                       r′ ) thus requires O(N ) work for each merge pair. We stress
Computing the exact value of Lentropy (ˆ
that this value is not needed to create a valid set of proposed free parameters rˆ′ , ηˆ′ , τˆ′ , νˆ′ . However,
                                                    r′ ) of the objective function L(ˆ
it is needed to evaluate the entropy term Lentropy (ˆ                                r′ , ηˆ′ , τˆ′ , νˆ′ ), which
is used to decide if the proposed set of free parameters improves over the current set.

Merge construction of observation model global parameters

Given a concrete set of responsibilities rˆ′ , or equivalently the summary statistics N (ˆ
                                                                                         r′ ) and S(ˆ
                                                                                                    r′ , x),
we can obtain proposed values for the observation model global parameters τˆ′ , νˆ′ via the standard
global update step in Eq. (3.41). This produces values τˆk′ , νˆk′ for each cluster k in the set of the K − 1
active clusters. By definition, this global step update solves the optimization problem of maximizing
       r′ , τˆ′ , νˆ′ ) given fixed rˆ′ .
Ldata (ˆ
    For original cluster labels k not involved in the merge – that is, k ∈
                                                                         / {kA , kB } – there will
be new parameters equal to the original parameters τˆk , νˆk , though perhaps reindexed due to the
label-shifting operation required to remove the cluster kB . Thus, savvy implementations need not
recompute these values and can instead just reindex the original parameters.

Merge construction of DP-allocation-model global parameters

For the DP mixture model, we can find the optimal global allocation parameters ηˆ′ given rˆ′ via the
update in Eq. (3.43). Any cluster with index k < kA in the original model will be unchanged after
this update. However, all clusters k ∈ [kA , kB ] will have new values after this merge update. This is
due to the dependence of the updates on the effective count statistic N > , which after a merge may
be computed as:
                                                           P
                                                              K
                                                                        r′ )
                                                                      N (ˆ       if kA ≤ k < kB
                                                            ℓ=k+1 k
                                                           
                                                           
                                                     r′ ) = Nk> (ˆ
                    for k ∈ 1, 2, . . . K − 1 : Nk> (ˆ           r)              if k < kA                  (4.4)
                                                           
                                                           
                                                           
                                                           N > (ˆ
                                                              k+1  r)            if k ≥ kB

4.1.2        Evalation of merge proposals
We represent the existing state of the DP mixture variational approximate posterior with K active
clusters via the full set of free parameters rˆ, ηˆ, τˆ, νˆ. After a merge of kA and kB , we have a fully-
constructed candidate state with K − 1 active clusters: rˆ′ , ηˆ′ , τˆ′ , νˆ′ . We decide whether to accept or
reject this proposal by comparing the ELBO objective scores for both states. If our merge candidate
state improves the ELBO score, we accept it. Otherwise, we reject it.
                                                                                                                                   83


   The acceptance score is defined as:

  L(x, rˆ′ , ηˆ′ , τˆ′ , νˆ′ ) − L(x, rˆ, ηˆ, τˆ, νˆ) , Ldata (x, rˆ′ , τˆ′ , νˆ′ ) − Ldata (x, rˆ, τˆ, νˆ)                    (4.5)
                                                               r′ ) − Lentropy (ˆ
                                                   + Lentropy (ˆ                                 r′ , ηˆ′ ) − LDP-alloc (ˆ
                                                                                r ) + LDP-alloc (ˆ                       r , ηˆ)


Below, we discuss crucial simplifications for each of the Ldata term, the Lentropy term, and the
LDP-alloc that make rapid evaluation possible.

Simplification of data term.                    First, we evaluate the relative improvement in the Ldata term. We
assume that each state has constructed its global parameters optimally given the local responsibili-
ties. That is, for each existing cluster k, we have τˆk = Sk (ˆ
                                                              r , x) + τ¯ and νˆk = Nk (ˆ
                                                                                        r ) + ν¯ via Eq. (3.41).
Likewise, τˆ′ , νˆ′ are functions of (statistics of) rˆ′ . Then the slack-term simplifications of Ldata kick
in and we have

                                                                      τk′ A , νˆk′ A ) − cP (ˆ
            Ldata (x, rˆ′ , τˆ′ , νˆ′ ) − Ldata (x, rˆ, τˆ, νˆ) = cP (ˆ                      τkA , νˆkA ) − cP (ˆ
                                                                                                                τkB , νˆkB )   (4.6)
                                                                      P           ′                      ′
                                                                           r , x) + τ¯, NkA (ˆ
                                                                 = c (SkA (ˆ                 r ) + ν¯)
                                                                    − cP (SkA (ˆ
                                                                               r , x) + τ¯, NkA (ˆ
                                                                                                 r ) + ν¯))
                                                                    − cP (SkB (ˆ
                                                                               r , x) + τ¯, NkB (ˆ
                                                                                                 r ) + ν¯))

Thus, accepting a merge sensibly requires only terms related to clusters involved in a merge. Any
other existing cluster k ∈
                         / {kA , kB } has no impact on the acceptance decision.

Simplification of entropy term. Assuming that rˆ′ was constructed from rˆ via Eq. (4.1), and
               r′ ) is defined as in Eq. (4.3), we have:
consequently H(ˆ
                                                                          K−1
                                                                          X                  K
                                                                                             X
                                         r′ ) − Lentropy (ˆ
                               Lentropy (ˆ                r) =                      r′ ) −
                                                                                Hk (ˆ              Hk (ˆ
                                                                                                       r)                      (4.7)
                                                                          k=1                k=1

                                                                           r′ ) − HkA (ˆ
                                                                    = HkA (ˆ           r ) − HkB (ˆ
                                                                                                  r)

Again, this sensibly only involves terms related to the pair of clusters kA , kB being merged. This
simplified difference-of-entropies will always be negative for any merge pair, because by definition
     r′ ) ≤ HkA (ˆ
HkA (ˆ           r ) − HkB (ˆ
                            r ).

Simplification of DP-allocation term.                           Assuming that both the current parameters ηˆ and pro-
                           ′
posed parameters ηˆ were computed by applying the update of Eq. (3.43) using the correponding
counts N (ˆ          r′ ), we can find similar simplifications for the allocation model objective, which
          r ) and N (ˆ
is given by
                                                                       K−1
                                                                       X                                          
                               r′ , ηˆ′ ) − LDP-alloc (ˆ
                    LDP-alloc (ˆ                       r , ηˆ) =                 cBeta (ˆ′
                                                                                        ηk1     ′
                                                                                            , ηˆk0 ) − cBeta (1, γ)            (4.8)
                                                                          k=1
                                                                           K 
                                                                           X                                        
                                                                       −                 ηk1 , ηˆk0 ) − cBeta (1, γ)
                                                                                  cBeta (ˆ
                                                                           k=1
                                                                                                                                                 84


After substituting in Eq. (4.4) and simplifying, we have:

                        r′ , ηˆ′ ) − LDP-alloc (ˆ
             LDP-alloc (ˆ                       r , ηˆ) = cBeta (1, γ) − cBeta (ˆ
                                                                                ηkB 1 , ηˆkB 0 )                                               (4.9)
                                                                                  B −1
                                                                                 kX
                                                                                                 ′      ′
                                                                             +           cBeta (ˆ
                                                                                                ηk1 , ηˆk0 ) − cBeta (ˆ
                                                                                                                      ηk1 , ηˆk0 )
                                                                                 k=kA

                                                                           = cBeta (1, γ) − cBeta (ˆ
                                                                                                   ηkB 1 , ηˆkB 0 )
                                                                                      ηkA 1 + NkB , ηˆkA 0 − NkB ) − cBeta (ˆ
                                                                             + cBeta (ˆ                                     ηkA 1 , ηˆkA 0 )
                                                                                   B −1
                                                                                  kX
                                                                             +                     ηk1 , ηˆk0 − NkB ) − cBeta (ˆ
                                                                                            cBeta (ˆ                           ηk1 , ηˆk0 )
                                                                                 k=kA +1

Here, this involves not only the clusters kA , kB directly involved in the merge, but all clusters with
indices between kA and kB . This is due to the adjusted value of N > at those in-between indices in
Eq. (4.4).


4.1.3                    Scalable construction and evaluation via memoized statistics
For large datasets, the simplified decision rules above suggest that we need not directly instantiate
the whole N ×K −1 responsibilities matrix rˆ′ to accept or reject a candidate merge of clusters kA and
kB . Instead, in the next pass through the full dataset we need only track the typical whole-dataset
summary statistics N G , S G , H G of our memoized algorithm in Alg. 3.3 as well as the merged entropy
            r′ ). From these statistics alone, we can directly create all candidate global summaries
scalar HkA (ˆ
    ′            ′            ′
        G            G            G
N           ,S           ,H           for the merged configuration.
        Thus, before a full pass through the dataset, we only need to decide on a pair kA , kB to consider
(see Sec. 4.1.4 for this decision process). Then, we proceed with normal memoized updates at each
                                                                      P
batch, with the one additional requirement of computing Hb,kA (ˆr′ ) = n∈Db (ˆ                  rnkA +
                                                                             rnkA + rˆnkB ) log(ˆ
rˆnkB ). That one extra scalar at each batch is all that is needed beyond the original fixed-truncation
algorithm. After the full pass, we can then use the simplified acceptance rules from the previous
section to decide whether to accept or reject the merge.
        We emphasize that these simplified construction rules apply because both statistics N, S are
linear functions of rˆ. Given current assignment summaries N G , S G , we can always construct a
                                                      ′            ′
candidate observation model N                             G
                                                              ,S       G
                                                                           , νˆ′ , τˆ′ for any valid merge pair and evaluate this candidate’s
objective score in Eq. (4.6) in time independent of the dataset size and without any additional local
inference steps.

Accepting multiple merge proposals

Suppose that before the start of a lap we have a list of M candidate merge pairs: {kmA , kmB }M
                                                                                              m=1 ,
where for each pair indexed by m we have kmA < kmB . Then, during a pass through each batch b of
                                             r′ ), and aggregate these to a global quantity Hm
a dataset, we compute the merge entropy Hbm (ˆ                                               G ′
                                                                                               (ˆ
                                                                                                r)
for each pair of clusters m. After visiting all batches, we can then sequentially evaluate each of the
M possible merge pairs one-by-one, accepting or rejecting each according to whether it improves the
                                                                                                                  85


objective L and updating the “existing” set of whole-dataset sufficient statistics N G , S G , H G after
every acceptance.
   The above greedy evaluation procedure allows us to accept many merge proposals after a single
pass through the dataset, so long as each accepted pair involves a distinct set of clusters. That is, if
the first pair (m = 1) with cluster indices (1, 2) was accepted already, then the pair (1, 3) appearing
later in the list of possible candidates must be rejected without evaluation. This is because its
                 r′ ) was computed using an old value of responsibilities for cluster 1 that does
entropy term Hm (ˆ
not include the recently accepted merge of cluster 1 and 2. Again, it is only the non-linear entropy
term that creates this complication. Overall, we are free to consider many candidate pairs during a
single pass of the data, but each original cluster k can appear in at most one of the (shorter) list of
accepted pairs. This still allows up to K/2 acceptances in every pass through the dataset.


4.1.4     Selecting a pair of clusters to try merging
                                                                                                    K(K−1)
Even with the simplified evaluation terms above, computing the entropy terms for all                  2      possible
                                                            2
merge pairs would have runtime cost of O(N K ), which is prohibitively expensive. Instead, we
recommend a selection step at the beginning of each lap through the dataset which identifies at
most M candidate pairs to consider, where M is specified by the user as a maximum budget.
   To rank possible candidates, we suggest using a score function based on the subset of the sim-
plified acceptance criteria:

                         m-score(kA , kB ) = Ldata (x, rˆ′ , τˆ′ , νˆ′ ) − Ldata (x, rˆ, τˆ, νˆ),              (4.10)
                                                                r′ , ηˆ′ ) − LDP-alloc (ˆ
                                                    +LDP-alloc (ˆ                       r , ηˆ)

where we create the candidate local parameters rˆ′ by merging clusters kA , kB and we compute these
terms via their simplified forms in Eq. (4.6) and Eq. (4.8). These terms can be evaluated exactly from
the existing global summaries N G , S G and parameters τˆ, νˆ, ηˆ. Computing the difference of LDP-alloc
terms has cost which is O(K) at worst, when kA = 1 and kB = K. Computing the difference of
Ldata terms always has cost independent of K and N .
   For the DP mixture model, the value of the m-score for a given pair kA , kB is a strict lower bound
of the actual difference of L values used to accept or reject the merge pair. Any pair with m-score
below zero will certainly be rejected once the difference-of-Lentropy terms are accounted for, because
this term is always negative. Thus, at the end of the selection step we retain only set of candidate
cluster pairs kA , kB with positive scores. This is the set of candidate merge pairs we track during
all local steps in the current lap.


4.2      Birth moves
The goal of a birth proposal is to create a candidate set of variational parameters rˆ′ , τˆ′ , νˆ′ , ηˆ′ which
modifies the current state rˆ, τˆ, νˆ, ηˆ by splitting an existing cluster j into two or more new clusters.
                                                                                                        86


     Original State:
                                                                        τ , νˆ)
                                                                    q(φ|ˆ
                                     rˆ
                                              q(π G |ˆ  ˆ)
                                                     u, ω
     Birth                           Step 1
     Proposal:
                                         rˆ′    Step 2
                                                                      τ ′ , νˆ′ )
                                                                  q(φ|ˆ


                                                     u′ , ω
                                              q(π G |ˆ    ˆ ′)


                       Figure 4.2: Illustration of birth proposal for DP mixtures.
Holistic view of birth proposal construction. We start with an existing model with 3 active clusters.
This original state is captured by the local assignment responsibilities rˆ, the cluster frequency
parameters u ˆ, ω
                ˆ , and the observation model parameters τˆ, νˆ. We select the orange cluster to target
with a birth proposal. The first step is the construction of new assignment responsibilities rˆ′ where
an mass previously assigned to orange is now distributed across several new clusters, shown here
in blue and green. The second step is construction of new global parameters τˆ′ , νˆ′ , uˆ′ , ω
                                                                                              ˆ ′ given the
                    ′
responsibilities rˆ . The complete proposal now has 5 active clusters, including the original, now-
empty orange cluster. This could be discarded immediately, but we show it here for later connections
to multi-batch proposals.
                                                                                                      87


                                                      rˆ
                                                           rˆn .01       .10    .89


      Step 1A:
      Subsample data
      from target cluster                                  rˆn′   .01    .10     0     .09    .8

      Step 1B:                   φ˜1 ˜
      Initialize new cluster         φ2
      point estimates via                    ˜1 π
                                             π  ˜2
      Bregman kmeans++

      Step 1C:
      Restricted local step                                rˆ′
      creates assignments
      to new clusters for
      all data in the batch.


          Figure 4.3: Illustration of local responsibility construction under birth proposal
Details of Step 1 from Fig. 4.2. Step 1A: Given the original responsibilities and a chosen cluster
(orange) to target, we first create a subsampled dataset. Step 1B: We use the Bregman k-means
algorithm from Alg. 2.1, with distance-biased initialization from Alg. 2.2, to create point estimates
for the global parameters of each new cluster, represented by φ˜ and π       ˜ . Variables denoted with ˜
have dimension corresponding to the number of newborn clusters. Step 1C: We perform a restricted
local step to obtain proposed responsibilities rˆ′ . There is one rˆn′ vector for each data atom n in the
original dataset.
                                                                                                           88


That is, we take a model with active truncation level K, and produce a model with truncation level
K + J ′ , where J ′ ≥ 2 is the total number of new clusters created by the birth proposal.
   Like the merge move, a birth move proceeds in three steps. First, we create candidate responsi-
bilities rˆ′ from the existing set rˆ. This leads immediately to the corresponding summary statistics
   r′ ), S(ˆ
N (ˆ       r′ ), H(ˆ
                   r′ ). Second, we create the associated global parameters τˆ′ , νˆ′ , ηˆ′ via the global up-
date step, so they achieve optimal values under the proposed responsibilities rˆ′ . Finally, we decide
to accept or reject the proposed approximate posterior, based on whether it improves the overall
objective.
   We first discuss these steps in detail for the simpler case of a single-batch dataset. We later
discuss handling proposals for large-scale, multi-batch memoized evaluation.


4.2.1     Birth proposal construction
To create a birth proposal under our DP mixture model, we require as input a chosen active cluster
index ktarget satisfying 1 ≤ ktarget ≤ K. Index ktarget specifies our target cluster, which we will
propose breaking up into two or more subclusters. In this section, we discuss how to create the
candidate parameters given this choice. Selection of ktarget is described later in Sec. 4.3.4.
   Given the chosen cluster ktarget , we need to create local responsibilities rˆ′ and global parameters.
We first construct the responsibilities rˆ′ for the dataset under this proposal. This construction is
done in two stages: we first create temporary global parameters for the assigned clusters, and then
we use these temporary parameters to construct rˆ′ . Later, we create the proposed values ηˆ′ , τˆ′ , νˆ′ of
the global parameters via a global optimization step rˆ′ .

Birth construction of initial global clusters

Given the whole dataset x and current responsibilities rˆ, we identify a targeted subset x′ which
is primarily assigned to cluster ktarget by selecting those atoms with responsibility values over a
threshold. That is, x′ = {xn : rˆnktarget > ǫ}, where ǫ ≈ 0.1 in practical implementations. Using
this subset, we can quickly identify a set of J ′ initial cluster shape parameters by running the
InitClusterMeansViaBregmanSamples algorithm from Alg. 2.2. This yields a set of mean
                   ′
parameters {µ′j }Jj=1 . Further iterations of the Bregman k-means algorithm in Alg. 2.1 from this
initialization produces a set of point estimates µ′ , π ′ which are likely to explain the target dataset
x′ well. The cost of each iteration of this creation step is linear in the size of the target dataset,
not the whole dataset. The cost is also linear in the new cluster label space J = {1, 2, . . . J ′ }, which
could be much smaller than the original label space of all K active clusters.
   Using these point estimates, we have defined a finite mixture model over the newborn clusters
in J which has a point-estimate for the cluster frequency parameter π ˜ (a vector of size J ′ that
sums-to-one) and a cluster mean parameter µ                                 ˜j
                                           ˜j for each newborn cluster. Let N
   We can further combine this model over the newborn clusters with our current model over the
existing clusters to create a candidate expanded model with a point estimated frequency probability
                                                                                                                        89


πk for each cluster k ∈ 1, 2, . . . K + J ′ .
                                   
                                    E [π ]                             if k ≤ K, k 6= ktarget
                                    q k
                                   
                                   
                          πk = 0                                        if k = ktarget                               (4.11)
                                   
                                   
                                   
                                   E [π
                                           q   ktarget ]˜
                                                        π(k−K)          if k > K

Birth construction of local responsibilities

We now need to use these new clusters to form responsibility parameters rˆ′ for all clusters (including
the original K clusters and the J ′ new clusters) which are consistent with the whole dataset. As
illustrated in Fig. 4.2, we will do this by conceptually reassigning probability mass from the original
target cluster ktarget to the new clusters at indices {K + 1, K + 2, . . . K + J ′ , while leaving all mass at
other non-target clusters alone. This mass transfer operation for creating each atom’s responsibility
vector rˆn′ from rˆn obeys the original constraints on this vector: it must be non-negative and sum-
to-one.
    Under these constraints, we can effectively reparameterize the full vector rˆn′ into a deterministic
                                                                pn1 . . . p˜nJ ] of length J ′ . We interpret
function of the original vector rˆn and a smaller vector p˜n = [˜
each entry p˜nj as the fraction of the original mass on cluster ktarget that is transfered to new cluster
j. This mass transfer vector is non-negative and sums-to-one. Using this reparameterization, we
have:
                                                                
                                                                 rˆ                        if k ≤ K, k 6= ktarget
                                                                 nk
                                                                
                                                                
    for k ∈ 1, 2, . . . K, K + 1, . . . K + J ′ :         ′
                                                        rˆnk   = 0                          if k = ktarget           (4.12)
                                                                
                                                                
                                                                
                                                                rˆ
                                                                      nktarget p
                                                                               ˜n,(k−K)     if k > K

This construction by definition satisfies the desired sum-to-one constraints on rˆn′ .

Restricted local step. The creation step above provides concrete values for the initial global
parameter estimates {˜              ˜j } for the set of new clusters labels J. Given these, we’d like
                     τj , ν˜j , and π
to find optimal responsibilities rˆn′ for each data atom n under our local assignment objective from
Eq. (3.46). As a function of rˆn′ , the objective is:
                                                           X
                                       Ln (xn , rˆn′ ) =           ′
                                                                 rˆnk         ′
                                                                      Wnk − rˆnk       ′
                                                                                 log rˆnk                            (4.13)
                                                           k=1

which we can decompose into a sum over the original clusters and a sum over the newborn clusters
in label space J:
              K
              X                                  J
                                                 X
          =         rˆnk Wnk − rˆnk log rˆnk +                         ˜ nj − rˆnktarget p˜nj log[ˆ
                                                       rˆnktarget p˜nj W                          rnktarget p˜nj ]   (4.14)
              k=1                                j=1

where the posterior weights for new clusters are given by:

                               ˜ nj (xn , τ˜, ν˜, π
                               W                  ˜ ) = Eq(φj |˜νj ,˜τj ) [log p(xn |φj )] + log π
                                                                                                 ˜j                  (4.15)
                                                                                                                     90


which as a function of the reassignment probabilities p′n simplifies to
                                                                   J
                                                                  hX                                i
                              Ln (xn , rˆn , p′n ) = rˆnktarget              ˜ nj − p˜nj log p˜nj
                                                                        p˜nj W                                    (4.16)
                                                                  j=1

                            ˜ for the new clusters, we can find the optimal reassignment probabil-
Given the posterior weights W
ities by using p˜∗ = RespFromWeights(W   ˜ n ). This delivers a vector of size J ′ that sums-to-one, as
                 n
required by our constraints.
    Thus, we may compute the optimal reassignment probabilities via a version of the local update
step from Eq. (3.48) which is restricted to the new label space J. This restricted local step will be
much more affordable in general than computing rˆ′ for many new clusters.
    The runtime cost of the restricted local step scales linearly with the number of new states J ′ and
remains independent of the total number of states K. This keeps proposal construction manageable
even for large numbers of clusters. The cost is, however, linear in the total number of atoms N ,
which may be slow. To make costs more manageable, we can focus on the subset of atoms with
significant mass on the targeted cluster ktarget : Dn = {n : rˆnktarget ≥ ǫ}. We may perform restricted
local steps only on these atoms, and for any others use a deterministic rule where p˜n is set to an
indicator vector for the newborn cluster with maximum weight, or the closest cluster in the newborn
set.

Birth construction of sufficient statistics

After computing the reassignment vector p˜n for each new cluster, we can immediately compute the
effective count statistic Nk′ at each cluster index in the proposed model k ∈ {1, 2, . . . K + J ′ }:
                                  
                                  Nk (ˆ
                                  
                                       r)                   if k ≤ K, k 6= ktarget
                                  
                             ′
                        Nk (ˆ
                            r)= 0                            if k = ktarget                          (4.17)
                                  
                                  
                                   P
                                   N rˆ
                                      n=1 nktarget p
                                                   ˜n(k−K) if k > K

Similarly, we compute the cluster shape sufficient statistic Sk′ at each cluster index in the proposed
model k ∈ {1, 2, . . . K + J ′ }:
                                     
                                     Sk (ˆ
                                     
                                         r , x)                                 if k ≤ K, k 6= ktarget
                                     
                          ′
                     Sk (ˆ
                         r , x) =     0                                          if k = ktarget                   (4.18)
                                     
                                     
                                     PN
                                     
                                               ˆnktarget p˜n(k−K) s(xn )
                                           n=1 r                                 if k > K

Finally, for the entropy statistic at each cluster index k ∈ {1, 2, . . . K + J ′ }:

                    
                    H (ˆ
                        r)                                                               if k ≤ K, k 6= ktarget
                     k
                    
                    
                  ′
             Hk (ˆ
                 r)= 0                                                                   if k = ktarget           (4.19)
                    
                    
                    PN rˆ
                    
                                n=1 nktarget p
                                             ˜n(k−K)           rnktarget p˜n(k−K) ] if k > K
                                                           log[ˆ
                                                                                                                            91


Birth construction of global parameters

After the above restricted local step, we have a candidate set of responsibilities rˆ′ for the expanded
model with K + J ′ clusters, as well as the appropriate sufficient statistics N G , S G , H G summarizing
these assignments.
    The next step is to obtain corresponding global free parameters for our DP mixture model. That
                                                                        ′
                                                    K+J
is, we need observation model parameters {ˆ
                                          τk , νˆk }k=1 for each of the K original clusters and J ′
new clusters. We also need allocation parameters ηˆk for this expanded set of clusters.

Birth proposal global step for observation model. Eq. (3.41) gives a closed-form expression
for the parameters τˆk′ , νˆk′ given the summary statistics NkG (ˆ
                                                                 r′ ), SkG (ˆ
                                                                            r′ ). We need to do this explicitly
only for the newborn clusters at indices K + 1, . . . K + J ′ . Other indices will have the same global
parameters as before, so for these values we can simply copy the original state: for k ≤ K, k 6= ktarget ,
we have νˆk′ = νˆk and τˆk′ = τˆk . Finally, our proposal assigns zero mass to the target cluster, so its
optimal global parameters will match prior hyperparameters: νˆk′ target = ν¯ and τˆk′ target = τ¯.

Birth proposal global step for allocation model. Eq. (3.43) gives a closed-form update for
the parameters ηˆ′ given fixed rˆ′ . Like the merge proposal, this construction will change not only the
target cluster and the newborn clusters, but any non-target cluster in the set {ktarget + 1, ktarget +
2, . . . K}. This is due to inserting the newborn cluster indices after the original set of cluster labels
in stick-breaking order.


4.2.2       Birth proposal evaluation
As with merge proposals, we accept a candidate state rˆ′ , ηˆ′ , τˆ′ , νˆ′ using the decision score in Eq. (4.5).
Several terms like the differences of Ldata terms and LDP-alloc terms can be simplified using similar
analysis as in the derivation of the corresponding merge terms. The Ldata term simplifies to:
                                                    ′
                                                K+J
                                                X                                        K
           ′   ′   ′
                                                                                       X                              
  Ldata (ˆ
         r , τˆ , νˆ ) − Ldata (ˆ
                                r , τˆ, νˆ) =               τk′ , νˆk′ ) − cP (¯
                                                        cP (ˆ                  τ , ν¯) −   cP (ˆ
                                                                                               τk , νˆk ) − cP (¯
                                                                                                                τ , ν¯) (4.20)
                                                k=1                                  k=1
                                                                                        ′
                                                                                       J
                                                                                       X
                                                 P            P                                   ′       ′
                                                                                                                             
                                               τ , ν¯) − c (ˆ
                                          = c (¯            τktarget , νˆktarget ) +         cP (ˆ
                                                                                                 τK+j , νˆK+j ) − cP (¯
                                                                                                                      τ , ν¯)
                                                                                       j=1

This can be interpreted as a log ratios of cluster-specific marginal likelihoods, which is sensible given
that our overall objective L is interpreted as a lower bound of marginal likelihood.


4.2.3       Construction and evaluation via memoized statistics
When the full-dataset is divided into many batches B > 1, we do not want to wait until visiting
all batches before accepting a potential birth proposal. Accepting a proposal as soon as possible,
even after trying it on only one of the B batches, will reduce extra computation and propagate
useful changes quickly. However, this introduces some complications in the memoized tracking of
                                                                                                                              92


batch-specific summaries. After an accepted birth proposal at batch b that adds J ′ new clusters,
the new batch-specific summaries Nb , Sb , Hb will have larger truncation level K + J ′ , while other
batches remain with only K active clusters.
        We need to be sure we can track whole-dataset summaries that exactly summarize all visited
batches even if they have different truncation levels. For the DP mixture model, our nested trun-
cation means that at every batch b we can always insert empty clusters to the assignment statistics
without changing the optimization objective function score L. Thus, after a birth of J ′ new clusters
at some batch is accepted, we can effectively update every other batch b by inserting J ′ zeros to
each summary vector. For example, for the new count statistic Nb′ we have:

                                Nb′ (ˆ
                                     r ) = [Nb1 (ˆ
                                                 r ) Nb2 (ˆ
                                                          r ) . . . NbK (ˆ
                                                                         r ) 0, 0 , . . . 0]                               (4.21)

Inserting zeros allows on-demand expansion of any batch-specific summary vector from original
trunction K to any desired larger truncation K + J ′ where only the first K clusters are assigned to
data.
        Using these zero-expanded batch-specific statistics, we can then aggregate whole-dataset statistics
N , S G which accurately represent the entire dataset’s assignment to the active set of K +J ′ clusters.
    G
                                                                                               ′
After a birth at batch b′ only, we can create the new candidate value N                            G
                                                                                                       for the whole dataset
                               G
from the original values N         via:
                                              
                                              
                                               N G − Nb′ ktarget      if k = ktarget
                                     ′
                                               k
                                              
                                   NkG    =    NG                      else if k ≤ K                                       (4.22)
                                               k
                                              
                                              
                                              N ′
                                                  b′ k                 else if k > K
                                                                                                            ′
                                                                                                                G
Similar incremental updates can be used for whole-dataset cluster shape summaries S                                 and entropies
    ′
        G
H           .
        Importantly, we can compute the whole-dataset statistics without requiring an expansion at
every batch first. We can thus perform expansion at other batches lazily after an accepted birth,
waiting until the next visit to batch b before applying Eq. (4.21) to its cached values of Nb , Sb , Hb .


4.2.4           Selecting clusters to target with birth proposals
We could attempt a birth at every cluster, but this would be expensive. To save time, we often wish
to prioritize some clusters before others and perhaps limit ourselves to a fixed budget of the number
of births we can attempt at each batch.
        When visiting each batch, we need to determine which cluster k ∈ {1, 2, . . . K} to target with a
birth move, if any. We suggest an approach that accounts for two distinct cues: (1) how well each
cluster represents or fits its assigned data, and (2) how well each cluster has performed in past birth
moves. To quantify the model-fit criterion, we measure the following score function at each cluster
                                                                                                            93


k:
                                               1
                                 b-score(k) =     Ldata (N, S, νˆ, τˆ, k)                               (4.23)
                                               Nk
                                                   N
                                               1 X
                                             =        rnk E[log p(xn |φk )]                             (4.24)
                                               Nk n=1

Intuitively, this score is an average log probability. Large-magnitude negative values indicate clusters
whose assigned data are poorly predicted on average. At every batch, we greedily choose to target
the clusters with the lowest measured b-score value.
     In addition to modeling quality, we also track for each distinct cluster how often we have tried
a birth in the past, and how often these tries have succeeded. We always prioritize births that
have never failed before, and only try a cluster that has failed before if (a) the cluster has changed
considerably since its last failure, (b) the current batch has more atoms assigned to the target than
our last birht attempt, or (c) all clusters have at least one failure, so we must choose a last-resort
option.
     Finally, when choosing a cluster to target for birth moves, we also account for several practical
constraints. For example, we always skip any cluster that does not have adequate size (as measured
by the summary Nbk ) in the current batch, because we cannot learn a refined clustering from only a
few data atoms. Typically, this threshold is very small, on the order of 3-50. Furthermore, we always
skip any cluster that is currently a target for merge or delete moves. Only one type of proposal
move can be accepted at each cluster in a given lap.

Trying multiple births at each batch.             Under our proposed construction of candidate states,
a birth targeting cluster k never impacts the assignments of a non-involved original cluster j 6= k.
We may thus consider creating and evaluating multiple birth moves at once, so long as each targets
a distinct cluster.


4.3       Delete moves
The goal of a delete proposal move is to create a candidate set of variational parameters rˆ′ , τˆ′ , νˆ′ , ηˆ′
with one fewer cluster than the current state. Merge moves can remove a cluster only by reassigning
its mass to one other existing state. Deletes are designed to be more flexible, because they can
redistribute the target cluster’s assignments across many other clusters. However, this flexibility
comes with an increased runtime cost, since it is more expensive to infer how to redistribute among
multiple clusters than to use a deterministic formula to reassign responsibility mass.
     A delete proposal begins by selecting a single cluster index, denoted ktarget , to target for deletion.
Without loss of generality, we will assume that the target cluster is last in stick-breaking order:
ktarget = K. This allows the indices of remaining clusters to be the same in the current and
proposed states.
                                                                                                          94


     Original          Absorbing Target
                          set    cluster
     State:
                                                                        τ , νˆ)
                                                                    q(φ|ˆ
                                      rˆ
                                               q(π G |ˆ  ˆ)
                                                      u, ω
     Delete                          Step 1
     Proposal:
                                        ′                              τ ′ , νˆ′ )
                                    rˆ         Step 2
                                                                   q(φ|ˆ


                                                     u′ , ω
                                              q(π G |ˆ    ˆ ′)

                      Figure 4.4: Illustration of delete proposal for DP mixtures.
Holistic view of delete proposal construction of candidate state. We start with an existing model with
6 active clusters in the original state, which is defined by the local assignment responsibilities rˆ, the
cluster frequency parameters uˆ, ω ˆ , and the observation model parameters τˆ, νˆ. We select the orange
cluster to target with a delete proposal as well as several overlapping clusters as the designated
absorbing set. The first step is the construction of new assignment responsibilities rˆ′ where any
mass previously assigned to orange reassigned among the absorbing set, adding to any preexisting
mass already on these clusters. The second step is construction of new global parameters τˆ′ , νˆ′ , uˆ′ , ω
                                                                                                           ˆ′
                             ′
given the responsibilities rˆ . The complete candidate state has only K = 4 active clusters, having
deleted the original orange cluster.
                                                                                                               95


       Next, we select a subset of other clusters A ⊂ {1, 2, . . . K} that will absorb the mass from the
target cluster. If we select only one absorbing cluster, the resulting delete proposal will be the
equivalent to a merge proposal. Selecting more than one absorbing cluster leads to potentially
larger changes. Once the target cluster and its absorbing set are chosen, the move proceeds through
the standard steps of creating a proposed state and then deciding whether to accept or reject it.


4.3.1        Delete proposal construction of local responsibilities
A delete proposal consists of new local assignments rˆn′ for each data atom in the entire dataset.
Within the proposed assigment vector rˆn′ , all mass formerly on the target cluster is reassigned
among the absorbing set A.
       As usual, the candidate responsibility vector rˆn′ for data atom n must obey the non-negativity
and sum-to-one constraints of any categorical distribution parameters. We further assume that all
clusters not involved in the absorbing set remain at their previous values in rˆ. Thus, like birth moves
we can parameterize the delete in terms of a reassignment probability vector p˜n over the |A| entries
of the absorbing set.
                                                           
                                                    ′
                                                           rˆnk                                if k ∈
                                                                                                     /A
                  for k ∈ 1, 2, . . . K − 1 :     rˆnk =                        P                           (4.25)
                                                           (ˆ
                                                             r              +
                                                                 nktarget       ℓ∈A r
                                                                                    ˆnℓ )˜
                                                                                         pn,j   if k = Aj

Again, the vector p˜n has |A| non-negative entries and sums to one. The construction of rˆn′ above is
completely determined by the original vector rˆn and the reassignment probabilities p˜n .

Restricted local step for p˜n . Like the birth move, our delete move proceeds by estimating the
best reassignment probabilities p˜n for each data atom n via a restricted local step.
       We first imagine creating a finite mixture model for the restricted subset of |A| absorbing clusters
from our current DP mixture model. This limited model has point-estimates π
                                                                          ˜ for each of the
absorbing clusters:

                                                π
                                                ˜j = Eq [πk ],         k = Aj                               (4.26)

as well as observation model parameters {˜
                                         τj , ν˜j } copied directly from the corresponding absorbing
                                                  τk , νˆk }K
cluster indices within the complete original set {ˆ         k=1 .
       Given these global parameters, we can compute posterior weights for each cluster in the absorbing
set:

                            for j ∈ 1, 2, . . . |A| :   ˜ nj = log π
                                                        W          ˜j + Eq [log p(xn |φj )]                 (4.27)

                                                                                         ˜ n ).
After computing the weights, we set reassignment probabilities via p˜n ← RespFromWeights(W
This gives a procedure for constructing rˆ′ given the current assignments rˆ and the global parameters
ηˆ, τˆ, νˆ which has runtime that scales with the size of the absorbing set, not the total number of
clusters.
                                                                                                                          96


4.3.2     Delete proposal construction of global responsibilities
A delete proposal’s local construction delivers proposed assignments rˆ′ for the K − 1 remaining
clusters, as well as corresponding sufficient statistics. We can construct the global parameters via
the standard global optimization steps.


4.3.3     Scalable construction and evaluation with memoized statistics
For large datasets with B > 1 batches processed one batch at a time via our memoized algorithm,
we can still successfully construct and evaluate delete proposals where candidates are consistent
representations of the entire dataset. The basic template is much like the merge move, where at the
beginning of each pass through the dataset we identify the target cluster and the absorbing set. In
the first batch, we construct the proposed local parameters rˆb′ and save the corresponding statistics
N˜b , S˜b , H
            ˜ b for all involved clusters. Each of these tracked values scales with the number of absorbing
clusters |A|, which could be much less than the total number of active clusters K. At each subsequent
batch, we perform similar tracking and aggregate the proposed statistics. Finally, after seeing all
                                                                       ′            ′            ′
                                                                           G            G            G
batches, we can construct the relevant whole-dataset statistics N              ,S           ,H           , which represent in
                                        ′
aggregate the proposed assignments rˆ across all batches. Finally, these summary statistics are fed
into the standard global step to obtain candidate global parameters and we can evaluate the resulting
candidate.

Extensions to multiple delete proposals. There are two ways to think about performing delete
proposals for multiple clusters. First, we could consider performing 2 separate delete proposals for
two candidate clusters j and k. Each proposal would be constructed and evaluated independently.
As with merge and birth proposals, we cannot simultaneously accept proposals which involve the
same clusters, either as the target or in the absorbing set. However, we can consider multiple such
proposals and only accept the one which is best. If the total involved sets of clusters in two or moves
are disjoint, we can accept both.
   The second way to think about multiple proposals is to imagine removing two or more clusters
at the same time and redistributing their mass among the remaining clusters. This is certainly
possible, but risks combinatorial explosion in the number of possible pairs or triples of clusters with
the number of absorbing sets.


4.3.4     Selecting the target cluster to delete
Two key choices are required in the specification of a delete move: the targeted cluster ktarget and
the absorbing set of clusters A ⊂ {1, 2, . . . K}. We first discuss the choice of ktarget , and later the
choice of A.
   We can systematically try all possible values of ktarget in each lap, but this may be expensive
since we can accept only one option at a time. Instead, we recommend first tracking which clusters
have failed attempts at deletion in the past. After a cluster has failed more than a prescribed number
                                                                                                     97


of proposal attempts, it should not be tried again. Among the remaining clusters, we might use a
size bias (preferring to try to delete small but non-zero clusters) or choose them at random. As a
practical concern, we avoid trying to simultaneously target a cluster for birth, merge, and deletion.
   If speed is not a concern, the absorbing set may be chosen simply as all clusters except ktarget .
This will provide the greatest chance of acceptance, but has cost that scales with O(K − 1). When
K is in the hundreds or thousands, tracking such a large absorbing set may be problematic. Instead,
we can more cleverly identify a much smaller set of 2 to 10 “nearest neighbors” to the chosen
target cluster ktarget . One option would be to choose the closest clusters in the sense of Bregman
divergence from the current target’s mean parameters. Another option might be to examine the
pair-wise cooccurance within the responsibilities rˆ and choose potential absorbing clusters this way.


4.4     Experimental results
4.4.1    Toy example where deletes outperform merges
To illustrate the practical benefits of our delete move, we consider the task of training a DP mixture
model on a synthetic dataset with N = 25, 000 observations all drawn from a single Gaussian cluster
with mean 0 and variance 1. We consider only the full-dataset training algorithm here with different
possible proposal moves, to focus on the question of how well the different moves help us recover the
true cluster. For hyperparameters, we set γ = 10 and the observation model’s prior variance to 1.
   The resulting inference task is very simple: recover the single true cluster. Ideally, every training
algorithm could find the single true cluster from any initialization, regardless of how many clusters
we start with. However, we will see in practice that this is not the case.
   Fig. 4.5 shows what happens when training a model from K = 5 initial clusters with merge moves
only. Early on, the merge moves remove two clusters and identify a K = 3 model. However, this
represents a tricky local optima, where no further merge move is accepted. The standard coordinate
ascent iterations have not strictly converged but change very slowly. After hundreds of iterations
from this fixed point, the ELBO score changes imperceptably.
   Although further merge moves are attempted, no merge of the remaining three clusters (illus-
trated as red, yellow, and blue in Fig. 4.5) improves the ELBO objective. Some merges (such as red
and blue) result in small but noticeable ELBO drops, while a merge proposal combining blue and
yellow sensibly leads to a much bigger drop due to the much worse configuration that yields.
   In contrast to merges, delete moves are more flexible. Delete proposals involve iterations that
improve on all proposed responsibilities in the absorbing set, while merges by definition deliver a
single fixed set of responsibilities. Even in this simple case, the iterative nature of delete proposals
leads to much better candidate models, as the ELBO traces in Fig. 4.5 illustrate. Of course, we note
that we need to run the delete proposal for sufficiently many iterations. It can be difficult to know
on arbitrary datasets how long is required.
                                                                                                                 98


                           Data
           0.60
                                                               −1.424
           0.45


                                                  train ELBO
  p.d.f.


                                                               −1.432
           0.30                                                                      delete red
                                                                                     delete blue
                                                               −1.440                current best
           0.15                                                                      merge red & yellow
                                                                                     merge red & blue
                                                               −1.448                merge yellow & blue
           0.00
                    −1.5   0.0    1.5     3.0                          0      250   500      750      1000
                            x                                              num pass thru dataset

                       Best model                                       Current local optima.
           0.60                                                0.60

           0.45                                                0.45
  p.d.f.


                                                  p.d.f.
           0.30                                                0.30

           0.15                                                0.15

           0.00                                                0.00
                    −1.5   0.0    1.5     3.0                               −1.5    0.0      1.5           3.0
                            x                                                       x

                  Delete blue: Accepted                               Merge red&blue: Rejected
           0.60                                                0.60

           0.45                                                0.45
  p.d.f.


                                                  p.d.f.


           0.30                                                0.30

           0.15                                                0.15

           0.00                                                0.00
                    −1.5   0.0    1.5     3.0                               −1.5    0.0      1.5           3.0
                            x                                                       x

             Figure 4.5: 1D Gaussian toy example where merges fail but deletes succeed
Example of poor local optima on simple dataset of 1D samples from single Gaussian cluster with
mean 0 and variance 1. Using an initialization with 5 clusters, training with merge moves only leads
to a fixed point with 3 clusters which persists for hundreds of iterations. A merge proposal for any
pair of clusters leads to lower ELBO score and thus rejection. However, the more flexible delete
proposal is accepted. Top left: histogram of raw dataset. Top right: training ELBO vs. iterations for
the merge-only training run, as well as example results of merges and deletes performed at iteration
200. Middle: Illustrations of the optimal cluster (used to generate the data) as well as the current
local optima. Bottom: Illustrations of proposed models trying to escape from the K = 3 current
local optima. The delete move is accepted, while the merge is rejected.
                                                                                                                                                                                  99


        6


                                                                                       log evidence x106
                           1.04                                                                            1.04


        log evidence x10
                           1.03                                                                            1.03
                           1.02                                                                            1.02
                                                               Full K=25
                           1.01                                MO K=25                                     1.01                                        SOa K=25
                              1                                GreedyMerge                                    1                                        SOb K=25
                           0.99                                MO−BM K=1                                   0.99                                        SOc K=25
                                  3   6   9   12 15 18 21 24 27 30                                                  3    6   9   12 15 18 21 24 27 30
                                      num. passes thru data (N=100000)                                                   num. passes thru data (N=100000)

  Data: 5x5 patches                          worst MO-BM                       worst MO                                        worst Full                   best SOb

                                            0.13   0.13   0.12   0.12   0.00    0.00   0.13                  0.12       0.00    0.00   0.25   0.13   0.25    0.13   0.12   0.12


                                            0.13   0.13   0.13   0.12   0.13    0.25   0.25                  0.13       0.13    0.25   0.25   0.00   0.13    0.13   0.13   0.00


    Figure 4.6: Comparison of scalable DP mixture algorithms on synthetic image patch dataset.
Results on toy dataset where generated 5x5 patches represent strong edges and corners, inspired
by samples from a “dead leaves” model (Zoran and Weiss, 2012). Top: Trace of ELBO during
training across 10 runs. Stochastic (SO) algorithm compared with different learning rates a,b,c.
Memoized algorithm with birth and merge moves (MO-BM) uses original birth proposals from
(Hughes and Sudderth, 2013) which are not guaranteed to improve the objective, but nevertheless
consistently find high-quality solutions. Bottom Left: Example patch generated by each of the
K = 8 true clusters. Bottom Center: Illustration of the covariance matrices for the trained clusters
discovered by one run of each inference method, aligned to the 8 true clusters. “X” indicates no
comparable component found. Each covariance matrix has its global appearance probability πkG
listed below. The true probabilites are 0.125 for each of the 8 clusters.


4.4.2                      Toy image patch data
We now compare algorithms for learning DP-Gaussian mixture models (DP-GMM), using our own
implementations of full-dataset, stochastic online (SO), and memoized online (MO) inference, as
well as our new birth-merge memoized algorithm (MO-BM). To examine SO’s sensitivity to learning
rate, we use a recommended Hoffman et al. (2013) decay schedule ξt = (t + d)−κ with three diverse
settings: a) κ = 0.5, d = 10, b) κ = 0.5, d = 100, and c)κ = 0.9, d = 10.
   For some experiments, we also compare against Kurihara’s public implementation of full-dataset
split-move variational inference Kurihara et al. (2006).                                                            Hyperparameters were chosen for each
dataset in an empirical Bayes manner.
   We first study N = 100000 synthetic image patches generated by a zero-mean GMM with 8
equally-common components. Each one is defined by a 25 × 25 covariance matrix producing 5 × 5
patches with a strong edge. We investigate whether algorithms recover the true K = 8 structure.
Each fixed-truncation method runs from 10 fixed random initializations with K = 25, while MO-BM
starts at K = 1. Online methods traverse 100 batches (1000 examples per batch).
   Fig. 4.6 traces the training-set ELBO as more data arrives for each algorithm and shows estimated
covariance matrices for the top 8 components for select runs. Even the best runs of SO do not recover
ideal structure. In contrast, all 10 runs of our birth-merge algorithm find all 8 components, despite
initialization at K = 1. The ELBO trace plots show this method escaping local optima, with slight
                                                                                                                                                                                          100


                               Smart (k-means++) Initialization                                                                 Random Initialization


                                                                                                   log evidence x106
        6
                           −2.85                                                                                        −3

        log evidence x10    −2.9
                                                                                                                       −3.5
                           −2.95
                             −3
                                                                                                                        −4
                           −3.05                              20 batches                                                                                                    20 batches
                                                              100 batches                                                                                                   100 batches
                            −3.1                                                                                       −4.5
                                   SOa SOb SOc   Full   MO MO−BM Kuri                                                         SOa SOb SOc                      Full   MO MO−BM Kuri
                                                                               0.82


                                                                                                                                 num. components K
                                                                                                                                                     100


                                                          Alignment accuracy
                                                                                0.8
                                                                                                                                                     80
                                                                               0.78
                                                                                                                                                     60
                                                                               0.76
                                                                                                                                                     40                Full
                                                                               0.74
                                                                                                                                                     20                MO
                                                                               0.72                                                                                    MO−BM
                                                                                                                                                       0
                                                                                0.7                                                                     0 40 80 120 160 200
                                                                                  40 50 60 70 80 90 100 110                                          num. pass thru data (N=60000)
                                                                                    Effective num. components K


           Figure 4.7: Comparison of scalable DP mixture algorithms on MNIST dataset.
Results on MNIST dataset. Top: Comparison of final ELBO for multiple runs of each method,
varying initialization and number of batches. Stochastic online (SO) compared at learning rates
a,b,c. Bottom left: Visualization of cluster means for MO-BM’s best run. Bottom center: Evaluation
of cluster alignment to true digit label. Bottom right: Growth in truncation-level K as more data
visited with MO-BM.

drops indicating addition of new components followed by rapid increases as these are adopted. They
further suggest that our fixed-truncation memoized algorithm competes favorably with full-data
inference, often converging to similar or better solutions after fewer passes through the data.
   The fact that our MO-BM algorithm only performs merges that improve the full-data ELBO
is crucial. Fig. 4.6 shows trace plots of GreedyMerge, a memoized online variant that instead uses
only the current-batch ELBO to assess a proposed merge, as done in Bryant and Sudderth (2012).
Given small batches (1000 examples each), there is not always enough data to warrant many distinct
25 × 25 covariance components. Thus, this method favors merges that in fact remove vital structure.
All 5 runs of this GreedyMerge algorithm ruinously accept merges that decrease the full objective,
consistently collapsing down to just one component. Our memoized approach ensures merges are
always globally beneficial.


4.4.3                MNIST digit clustering
We now compare algorithms for clustering N = 60000 MNIST images of handwritten digits 0-9.
We preprocess as in Kurihara et al. (2006), projecting each image down to D = 50 dimensions
via PCA. Here, we also compare to Kurihara’s public implementation of variational inference with
split moves Kurihara et al. (2006). MO-BM and Kurihara start at K = 1, while other meth-
ods are given 10 runs from two K = 100 initialization routines: random and smart (based on
k-means++ Arthur and Vassilvitskii (2007)). For online methods, we compare 20 and 100 batches,
and three learning rates. All runs complete 200 passes through the full dataset.
                                                                                                  101


        log evidence x107
                            −1.55
                            −1.56
                            −1.57                                    SOa K=100
                            −1.58                                    SOb K=100
                            −1.59
                             −1.6                                    Full K=100
                            −1.61                                    MO K=100
                            −1.62                                    MO−BM K=1
                                    5   10 15 20 25 30 35 40 45 50
                                         num. passes thru data (N=108754)


      Figure 4.8: Comparison of scalable DP mixture algorithms for clustering tiny images.
 Observed data: 32x32 color images from SUN dataset, projected via PCA so each observation xn
has dimension 50. Left: ELBO during training. Right: Visualization of 10 of 28 learned clusters for
best MO-BM run. Each column shows two images from the top 3 categories aligned to one cluster.


   The final ELBO values for every run of each method are shown in Fig. 4.7. SO’s performance
varies dramatically across initialization, learning rate, and number of batches. Under random initial-
ization, SO reaches especially poor local optima (note lower y-axis scale). In contrast, our memoized
approach consistently delivers solutions on par with full inference, with no apparent sensitivity to
the number of batches. With births and merges enabled, MO-BM expands from K = 1 to over 80
components, finding better solutions than every smart K = 100 initialization. MO-BM even outper-
forms Kurihara’s offline split algorithm, yielding 30-40 more components and higher ELBO values.
Altogether, Fig. 4.7 exposes SO’s extreme sensitivity, validates MO as a more reliable alternative,
and shows that our birth-merge algorithm is more effective at avoiding local optima.
   Fig. 4.7 also shows cluster means learned by the best MO-BM run, covering many styles of each
digit. We further compute a hard segmentation of the data using the q(z) from smart initialization
runs. Each DP-GMM cluster is aligned to one digit by majority vote of its members. A plot of
alignment accuracy in Fig. 4.7 shows our MO-BM consistently among the best, with SO lagging
significantly.


4.4.4                       Clustering tiny images
We next learn a full-mean DP-GMM for tiny, 32 × 32 images from the SUN-397 scene categories
dataset (Xiao et al., 2010). We preprocess all 108754 color images via PCA, projecting each example
down to D = 50 dimensions. We start MO-BM at K = 1, while other methods have fixed K = 100.
Fig. 4.8 plots the training ELBO as more data is seen. Our MO-BM runs surpass all other algorithms.
   To verify quality, Fig. 4.8 shows images from the 3 most-related scene categories for each of
several clusters found by MO-BM. For each learned cluster k, we rank all 397 categories to find
those with the largest fraction of members assigned to k via rˆ·k . The result is quite sensible, with
clusters for tall free-standing objects, swimming pools and lakes, doorways, and waterfalls.
                                                                                              102


4.5     Discussion
Across many applications of the DP mixture model, we see clear gains from our adaptive proposal
moves compared to fixed-truncation baselines. The merge moves are particularly effective at se-
lecting promising pairs of clusters and successfully merging them without too much additional cost
over the standard fixed-truncation algorithm. When possible, using moves that provably increase
the objective L makes the process both easier to debug and more reliable when applied to new
applications.
Chapter 5

Scalable variational inference for
HDP Topic Models

Ch. 1 motivated topic models (Blei, 2012) as a flexible extension of mixture models which allow each
group or document d to have its own specific mixture weights πd . Teh et al. (2006) introduced the
hierarchical Dirichlet process (HDP) as a Bayesian nonparametric prior for these document-specific
random variables πd , which leads to improved statistical strength in estimating these values and
generalizing to novel documents.
   We now develop algorithms for scalable training of approximate posterior representations of the
hierarchical Dirichlet process (HDP) topic model. This chapter describes and extends earlier work
described in an AISTATS 2015 conference paper, with collaborators Dae Il Kim and Erik Sudderth
(Hughes et al., 2015a). The fundamental contributions here identified below:

Contribution 1: New optimization problem for topic models with model selection capa-
bility. First, we set up a novel optimization problem for inference which learns approximate pos-
teriors, not point estimates, for all global random variables u, φ as well as all local random variables
{πd , zd }. We show that this approach leads to beneficial model selection properties, while the point
estimation strategy for π G (equivalently for u) used in previous approaches (Bryant and Sudderth,
2012; Liang et al., 2007) has problems. Achieving this requires a careful surrogate bound to deal
with non-conjugacy in the HDP.

Contribution 2: Improved local step single-document inference, with applications to
finite and HDP topic models. When visiting a single document, the problem of updating the
posteriors for local random variables q(zd ) and q(πd ) in any mean-field approach is non-convex with
abundant local optima. We first bring attention to this problem by characterizing the practical
issues on real data. We then develop novel restart proposals to mitigate this issue.


                                                  103
                                                                                                  104


Contribution 3: Scalable memoized training algorithm for HDP topic model. Until
recently, only stochastic variational algorithms were known for training the finite LDA or infinite
HDP topic models. We have developed a memoized algorithm with the same per-batch runtime cost
and modest storage. Due to non-convexity in the local update step, the algorithm for any topic
model lacks monotonicity guarantees but we find in practice that it performs well.

Contribution 4: Birth, merge, and delete proposals to escape local optima. We finally
develop proposal moves to escape local optima by adding or removing active clusters. We show that
the topic model with multinomial likelihoods is vulnerable to especially cruel local optima, but our
moves can often escape from these.

Roadmap. After fully specifying the model, we identify our chosen family of approximate posterior
distributions and corresponding free parameters, define the optimization problem’s objective function
under this chosen family, and derive the basic coordinate ascent updates. Next, we describe scalable
versions of our algorithms and discuss the details of proposal moves for the HDP topic model
optimization problem.


5.1     Hierarchical Dirichlet process (HDP) topic models
Topic models, also called admixture models, explain datasets which are partitioned into D exchange-
able groups x = {x1 . . . xD }. We will often refer to each group as a document, because text modeling
is a primary application. However, the notion of grouping may apply equally well to observations
from different images, or different hospitals, or any similar application.
   Each group or document d contains Td observations, denoted xd = {xd1 , . . . xdTd }. In text
analysis, each xdt might represent an individual word from a predefined vocabulary. In image
analysis, each xdt might represent the real pixel values of the t-th patch from the d-th image.
Remember that we can also index observations without the document structure using the index n,
                                                  PD
and the total number of observations N is equal to d=1 Td .
   Our goal in analyzing the dataset x is to identify a common set of clusters or topics common
across all groups, while allowing each group to have variability in topic usage. Teh et al. (2006)
introduced the hierarchical Dirichlet process (HDP) topic model illustrated in Fig. 5.1 as a natural
model for this goal. The HDP is a hierarchical extension of the Dirichlet process mixture model.
It explains the data with an infinite set of possible clusters, {φk }K
                                                                     k=1 , but allows each group or
document to have a specific set of appearance probabilities πd .

Generative model

Like the DP mixture model, in the HDP topic model each topic k is defined by two global variables:
the data-generating cluster shape parameters φk , and a conditional cluster frequency uk . As in
Sec. 2.1.3, we will assume each cluster shape parameter is independently drawn from some prior
                                                                                                   105


                                                         ˆk
                                                         ω
                                                         ˆ k uk K
                                                         u
                                                 πG
                                           θˆd                  documents
                                                 πd           d = 1, 2, . . . D


                      zd1 rˆ zd2 rˆ zd3 rˆ · · ·                         z
                                                                     rˆdT dT
                               d1           d2              d3


                      xd1           xd2           xd3 · · · xdT

                                          νˆk φk
                                          τˆk    K

     Figure 5.1: Directed graphical representation of hierarchical Dirichlet process topic model.
The diagram shows the fundamental random variables (circled nodes), hyperparameters (gray), and
variational free parameters (red) of the hierarchical Dirichlet process (HDP) topic model. This
Bayesian nonparametric model generates a countably infinite number of clusters under the prior, of
which some number K ≤ N are assigned to data. Each cluster is defined by two global parameters:
conditional probability uk and shape parameter φk . The conditional probabilities u are determin-
istically mapped to global cluster probabilities π G via the invertible stick-breaking transformation.
The dataset x consists of D groups or documents, and is assumed to be within-group exchangeable.
In addition to the local cluster assignments zdt for each observation in the document, we also assume
local document-specific frequency vectors πd .


distribution φk ∼ P. Our focus will be on the generative model for cluster frequencies, because the
HDP introduces a more flexible model for this case.
   Let each scalar 0 < uk < 1 define the conditional probability of sampling topic k given that
the first k − 1 topics were not sampled. That is, the probability of selecting label k among the
infinite set {k, k + 1, k + 2, . . .}. Each value uk is sampled independently from a beta distribution:
uk ∼ Beta(1, γ).
   Using the stick-breaking transformation (Sethuraman, 1994; Blei and Jordan, 2006), we can de-
terministically produce a vector π G of global cluster probabilities given the vector u.
                                                     Qk−1
                                      πkG (u) , uk    ℓ=1 (1−uℓ ).                                (5.1)

We interpret the scalar πkG as the global probability of topic k appearing in any document. Unlike
the vector u, the vector π G will sum to one by construction.
   Next, each group or document has a uniquely-specified topic frequency vector πd . This vector
                                                                                                        106


has infinitely many entries and will sum-to-one. As a more tractable representatation, we can write
this as a finite vector of size K + 1, where the first K entries represent the first K cluster indices in
stick-breaking order and the final entry represents the aggregate mass of all entries beyond index K.
                                                                                    ∞
                                                                                    X
                             πd = [πd1 πd2 . . . πdK πd>K ],        πd>K ,=                πdℓ        (5.2)
                                                                                   ℓ=K+1

Given this finite partition of the infinite set of clusters, we define the generative model for the
document-specific vector πd given the global vector π G :

                              [πd1 . . . πdK πd>K ] ∼ Dir(απ1G , . . . απK
                                                                         G     G
                                                                           , απ>K ).                  (5.3)

This generative distribution implies the mean of the document-specific probability πdk is the global
probability πkG . Each document can fluctuate around this mean with variance determined by the
concentration parameter α > 0.
    We complete the generative model by describing the token creation process. First, we sample
the t-th observed token’s assigned cluster label zdt ∈ {1, 2, . . . K . . .} from a categorical distribution
with parameter πd :

                                        zdt ∼ Cat∞ (πd1 , πd2 , . . . πdK . . .)                      (5.4)

Finally, like the mixture model we generate each token’s data xdt from the chosen likelihood model:

                                                 xdt ∼ L(xdt |φzdt )                                  (5.5)

where again we assume the likelihood density L belongs to the exponential family, as described in
Sec. 2.1.3.

Interpretation: Document-specific mixture with hierarchically-related frequencies

When studying the model described above and illustrated in Fig. 5.1, we emphasize a specific in-
terpretation: each document’s exchangeable observations are generated from a document-specific
mixture model. This mixture model is generated by a hierarchical chain of Dirichlet process realiza-
tions. First, we draw cluster frequencies and shapes {πkG , φk }∞
                                                                k=1 from the root or global Dirichlet
process. Second, we draw for each document a set of frequency-shape pairs {πdk , φk }∞
                                                                                     k=1 , where the
the shape parameters are common across all documents but the frequencies are document-specific
random variables related by a common mean vector π G .

Global and local variables

By inspection of Fig. 5.1, we recognize two global parameters for the HDP topic model: the cluster
frequency parameters uk and cluster shape parameters φk . We then have two sets of local variables
at each document: the document-specific probability vector πd and the discrete assignments zd =
{zd1 , zd2 , . . . zdTd }.
                                                                                                                          107


5.2       Posterior inference as a variational optimization problem
Our goal in performing posterior inference for an HDP topic model is to estimate the joint posterior
p(φ, u, {πd , zd }D
                  d=1 |x) given observed documents x = {x1 , . . . xD }. As in DP mixture models, esti-
mating the joint posterior directly is intractable, so we set up a variational inference optimization
problem instead. The approach described fully below was first given in our earlier conference paper
(Hughes et al., 2015a).


5.2.1      Mean-field approximate posterior
Let q(φ, u, {πd , zd }D
                      d=1 ) denote our approximate posterior. We will make the standard mean-field
simplifying assumptions about factorizing q. That is, each document d has independent factors πd
and zd , while each global cluster k has independent factors for uk and φk . That is, we can factorize
the approximate density as
                                                 ∞
                                                 Y                ∞
                                                                  Y                D
                                                                                   Y                  Td
                                                                                                    D Y
                                                                                                    Y
                     q(φ, u, {πd , zd }D
                                       d=1 ) =         q(φk ) ·         q(uk ) ·         q(πd ) ·             q(zdt )    (5.6)
                                                 k=1              k=1              d=1              d=1 t=1

We further assume each factor of the approximate posterior has a density that belongs to the
exponential family, whose form naturally mimics the generative model when appropriate. The full
parameterization is given here:
                                              ∞
                                              Y
                                    q(φ) =          P(φk |ˆ
                                                          τk , νˆk )                                                     (5.7)
                                              k=1
                                              Y∞
                                    q(u) =          Beta(uk |ˆ  ˆ k , (1 − u
                                                             uk ω          ˆk )ˆ
                                                                               ωk )                                      (5.8)
                                              k=1
                                              D
                                              Y
                                    q(π) =          DirK+1 (πd |θˆd1 . . . θˆdK θˆd>K )                                  (5.9)
                                              d=1
                                                Td
                                              D Y
                                              Y
                                     q(z) =              Cat∞ (zdt |ˆ
                                                                    rdt1 , . . . rˆdtk . . .)                           (5.10)
                                              d=1 t=1

The goal of variational optimization is to find specific values of these free parameters that make
q(u, φ, {πd , zd }D
                  d=1 ) a good approximation to the true posterior. For each factor, we denote the free
parameters with hats to make clear which variables are instantiated and optimized during inference.
Below, we discuss the free parameters in each factor in detail.

Global allocation model free parameters

                                                                         ˆk ∈ [0, 1] defines the
In Eq. 5.8, we define q(uk ) as a Beta distribution with two parameters: u
mean value of uk , while ω
                         ˆ k > 0 defines the variance of uk . Under this distribution, we have the
expectations:

          Eq [uk ] = u
                     ˆk    Eq [1−uk ] = 1−ˆ
                                          uk                            Eq [log uk ] = ψ(ˆ  ˆ k ) − ψ(ˆ
                                                                                         uk ω         ωk )
                        Qk−1                                                                                            (5.11)
      Eq [πkG (u)]    ˆk ℓ=1 (1 − u
                     =u           ˆℓ )                            Eq [log 1 − uk ] = ψ((1 − u   ωk ) − ψ(ˆ
                                                                                            ˆk )ˆ        ωk )
                                                                                                            108


Global observation model free parameters

For each cluster k, we have an independent posterior q(φk ) over the cluster shape parameter φk . The
definition of q(φk ) in Eq. 5.7 follows the standard procedure from Sec. 2.1.3 using free parameters
νˆk and τˆk . There is nothing new here compared to the version of q(φk ) used for the DP mixture
model in Ch. 3.

Local parameters for document-specific cluster assignments

As with the DP mixture model, we have a non-negative responsibility vector rˆdt for each observation
t in document d. We can interpret the scalar value rˆdtk ∈ [0, 1] as the probability of assigning this
observation to cluster k. To be a valid parameter of a categorical distribution, we have the constraint
that the vector rˆdt must be non-negative and sum to one. To gain tractability, we follow the methods
in Sec. 3.3.1 and truncate rˆdt so that only the first K entries may have positive probability mass.
All inactive clusters k > K are constrained by assumption so that rˆdtk = 0.

Local parameters for document-specific cluster frequencies

The factor q(πd ) has the free parameter θˆd ≥ 0, which is a vector of size K + 1 whose entries are non-
negative. The first K entries of θˆd can be interpreted as a pseudocount of the total assigned mass
for the first K active clusters within document d. The remaining final entry defines the aggregate
mass of all inactive clusters. Here are useful expectations for active clusters k ≤ K:

                                    θˆdk                                                     PK
             Eq [πdk ] =              PK              Eq [log πdk ] = ψ(θˆdk ) − ψ(θˆd>K +        ˆ
                                                                                              k=1 θdk )   (5.12)
                           θˆd>K   + k=1 θˆdk

And for the remaining aggregate mass, we have the expectations:

                            θˆd>K                                                     PK
        Eq [πd>K ] =          P                 Eq [log πd>K ] = ψ(θˆd>K ) − ψ(θˆd>K + k=1 θˆdk )         (5.13)
                       ˆ
                       θd>K + K      ˆ
                                 k=1 θdk


5.2.2      Evidence lower-bound objective function
Given the assumed factorization of q(u, φ, {πd , zd }D
                                                     d=1 ) above, we now set up an optimization prob-
lem over our free parameters, just like we did for the DP mixture model in Sec. 3.3.2.                     The
goal is to minimize KL divergence between the approximate posterior q and the true posterior
(Wainwright and Jordan, 2008).

                  arg min KL(q(u, φ, {πd , zd }D                   ˆ rˆ)||p(u, φ, {πd , zd }D |x))
                                               d=1 |ˆ
                                                    τ , νˆ, u
                                                            ˆ, ω
                                                               ˆ , θ,                       d=1           (5.14)
                τˆ,ˆ  ˆ ,ˆ
                   ν, u     ˆr
                         ω, θ,ˆ

Computing this KL divergence directly is not possible. However, standard arguments yield an
equivalent maximization problem

                                           arg max L(x, τˆ, νˆ, u
                                                                ˆ, ω   ˆ rˆ)
                                                                   ˆ , θ,                                 (5.15)
                                         τˆ,ˆ  ˆ,ˆ
                                            ν, u     ˆr
                                                 ω , θ,ˆ
                                                                                                              109


where the objective function L is defined as:

             L(x, τˆ, νˆ, u
                          ˆ, ω   ˆ rˆ) , log p(x) − KL(q(u, φ, π, z|ˆ
                             ˆ , θ,                                  τ , νˆ, uˆ, ω   ˆ rˆ)||p(u, φ, z|x))
                                                                                 ˆ , θ,                     (5.16)
                                                    h                                             i
                                       = Eq(u,φ,π,z) log p(x, u, φ, π, z) − log q(u, φ, π, z)               (5.17)

Under our chosen exponential family forms for each factor of the approximate posterior q, the
expectations that define L are computable as closed-form functions of the free parameters. We can
tractably evaluate L and take derivatives with respect to the free parameters, making optimization
possible.
   As in Eq. 2.93 for DP mixtures, we can write the objective L for HDP topic models as a sum of
two terms:

                           L(x, rˆ, ηˆ, τˆ, νˆ) = Ldata (x, rˆ, τˆ, νˆ) + Lalloc (ˆ   ˆu
                                                                                  r , θ, ˆ, ω
                                                                                            ˆ)              (5.18)

These terms describe distinctly interpretable pieces of the overall model: Ldata gathers terms related
to the observation model and Lalloc gathers terms related to the HDP topic allocation model. These
terms may also be functions of the hyperparameters γ, α, τ¯, ν¯, but we omit these arguments in
notation for simplicity.
   This term-by-term breakdown of the objective encourages a modular implementation. Given
fixed assignments rˆ, the solution to finding the optimal observation model parameters νˆ, τˆ must be
independent of the allocation probability parameters, and vice versa. This modularization enables
our implementation to implement free parameter updates once for each possible observation model
or allocation model, and compose these modules to create an overall model.

Observation model term of the objective

The term Ldata for HDP topic models is no different from the same term for DP mixture models. We
thus refer to the original expression in Eq. 2.95, which can be evaluated given valid free parameters
rˆ, τˆ, νˆ as well as data x. This expression does not rely on any document-specific quantities, so
we use the single-level data index n to identify responsibilities rˆn and data xn , rather than the
corresponding two-level document-token index d, t.

Allocation model term of the objective

For the HDP topic model, we write the allocation model’s contribution to the objective as:
                                       D
                          ˆu
                                       X                     h    p(zd |πd )       p(πd |α, π G (u)) i
              Lalloc (ˆ
                      r , θ,    ˆ) ,
                             ˆ, ω           Eq(zd )q(πd )q(u) log            + log                          (5.19)
                                       d=1
                                                                  q(zd |ˆ
                                                                        rd )          q(πd |θˆd )
                                         ∞
                                         X            h     p(uk |γ) i
                                       +         Eq(u) log
                                                           q(uk |ˆ
                                                                 u, ω
                                                                    ˆ)
                                           k=1

Regrouping terms, we can separate this into an entropy term for the distribution q(z|ˆ
                                                                                     r), a term
which gathers any document-specific quantities, and a term for the remaining global quantities:

                    Lalloc (ˆ
                            r , ηˆ) = Lentropy (ˆ
                                                r ) + LHDP-doc (ˆ   ˆ uˆ) + LHDP-top (ˆ
                                                                r , θ,                u, ω
                                                                                         ˆ)                 (5.20)
                                                                                                                      110


    0                                                                    100

  −1                                                                       0


                                                        cDir function
                                                                                                            alpha
  −2                                                                    −100                                10.0

  −3                                                                    −200                                2.0
                                                                                                            1.0
                                                                                                            0.5
  −4                            -1*gammaln(x)                           −300          cDir exact
                                log(x)                                                cDir surrogate        0.1

  −5                                                                    −400
        0.0     0.2   0.4       0.6   0.8     1.0                           0    10      20     30     40   50
                            x                                                                  K

     Figure 5.2: Plots of surrogate bound needed to handle non-conjugacy of HDP topic model
Left: Illustration of the bound − log Γ(x) ≥ log(x) for x ∈ [0, 1], which inspired our surrogate bound.
Right: Illustration of the exact log Dirichlet cumulant function cDir (αb1 , αb2 , . . . αbK+1 ) (Eq. (5.22),
solid black) as a function of the number of active clusters K, for various α > 0. At        each value of K,
                                                                                     1 Q        γ
we set the vector b = [b1 b2 . . . bK b>K ] so its active entries are equal to bk = 1+γ    ℓ<k 1+γ , which is
the prior mean of the global probability vector π G , and the final entry b>K is set so the vector sums
to one. We show along side each exact evaluation our proposed surrogate bound (Eq. (5.23), dashed
red), which this plot shows to be very tight across a range of practical α values. Topic models are
typically trained with 0 < α < 1 to encourage sparsity.

The specific mathematical forms of each of these terms are given in the sections below.


5.2.3         Global term of the allocation objective, using a surrogate bound
To define the global allocation term, we have gathered all terms that depend only on the global free
parameters u
           ˆ and ω
                 ˆ.
                                                       p(uk )
                       LHDP-top (ˆ  ˆ ) , Eq(u) [log
                                 u, ω                         ] + DEq(u) [cDir (απ G (u))]                          (5.21)
                                                       q(uk )

These expectations have no dependencies on any local free parameters {θˆd , rˆd }D
                                                                                 d=1 . We have included
the Dirichlet cumulant function that appears in the expectation of Eq [log Dir(πd |απ G )], because this
expectation is purely a function of u
                                    ˆ, ω
                                       ˆ.
   Under the chosen form of q(u), we have closed-form expressions for the expectations of log p(uk )
and log q(uk ) in Eq. 5.21. However, the term Eq [cDir (απ G (u))] is problematic: there is no closed-form
expression for this expectation.
   To see this, recall that cDir is the cumulant function of the Dirichlet distribution. It takes two
parameters: a positive scalar α > 0 and a nonnegative vector b of length K + 1 that sums to one.
The function is then defined as:
                                                                                K+1
                                                                                X
                         cDir (αb1 , . . . αbK , αbK+1 ) , log Γ(α) −                 log Γ(αbk )                   (5.22)
                                                                                k=1

The difficulty is that the expected value of log Γ(απkG (u)) when uk is a beta random variable has no
known closed form. To avoid this problematic expectation of log Gamma functions, we introduce a
                                                                                                       111


                                                    HDP point est
                          change in ELBO
                                            1       HDP exact
                                                    HDP surrogate


                                            0


                                           −1
                                                0         1          2               3
                                                     num. empty topics


      Figure 5.3: HDP model selection: point estimation vs. variational with surrogate bound
We consider several possible algorithms for estimating top-level conditional probabilities u from
fixed set of observed assignments z. To assess model selection capability, we consider a fixed set of
assignments z with K = 10 active clusters, as well as a the same assignments with K = 11, 12, 13
active clusters. That is, the later models are simply inserting extra clusters with no assigned obser-
vations. In this plot, for each method we show the change in its optimization objective function (the
evidence lower bound, or ELBO) from the value with no extra empty topics. We seek a method whos
optimization objective function prefers no empty topics to all other cases. Algorithms: First, we
consider point estimation of u ˆ directly, which solves the optimization problem arg maxu,θˆ log p(z, u).
Second, approximate
             QK         posterior estimate, which seeks the parameters u    ˆ, ω
                                                                               ˆ which define the distribu-
tion q(u) = k=1 q(uk ) = Beta(mean=ˆ       uk , scale=ˆ
                                                      ωk ) which best approximates the true (intractable)
posterior p(u|z). The optimization problem here is arg maxuˆ,ˆω,θˆ L(ˆ  u, ω   ˆ z). Conclusions: Our new
                                                                           ˆ , θ,
surrogate bound sensibly prefers models without empty topics, while using point estimation yields
undesired preference for including empty topics that do not explain any data.
                                                                                                              112


novel lower bound for the cumulant function cDir . Given a positive scalar α > 0 and a nonnegative
vector [b1 , b2 , . . . bK , bK+1 ] of size K + 1 which sums to one, we have:
                                                                                PK+1
                        cDir (αb1 , αb2 , . . . αbK , αbK+1 ) ≥ K log α +          k=1   log bk             (5.23)

To deal with our intractable expectation Eq [cDir (απ G (u))], we can substitute π G (u) in for the vector
                                                              Q                           Q
b in the above bound and apply the expansions πkG = uk ℓ<k (1 − uℓ ) and π>        G
                                                                                    K =      ℓ≤K (1 − uℓ ).
After simplifying, we have
                                                  PK                  PK
                cDir (απ G (u)) ≥ K log α +          k=1   log uk +    k=1 (K   + 1 − k)[log 1 − uk ]       (5.24)

Under our chosen beta distribution for q(uk ), both expectations required by this lower bound,
Eq [log uk ] and Eq [log 1 − uk ], have closed form.
   Thus, we can define a surrogate objective term Lsurrogate-HDP-top which is a tight lower bound
on LHDP-top . That is, LHDP-top (ˆ  ˆ ) ≥ Lsurrogate-HDP-top (ˆ
                                 u, ω                         u, ω
                                                                 ˆ ) for all values of the parameters u
                                                                                                      ˆ, ω
                                                                                                         ˆ
and all hyperparameters α > 0, γ > 0. This surrogate objective term can be derived by combining
the bound in Eq. 5.24 with the expanded terms in Eq. (5.21):
                                                       K
                                                       X
        Lsurrogate-HDP-top (ˆ
                            u, ω
                               ˆ ) = DK log α +              cBeta (1, γ) − cBeta (ˆ  ˆ k , (1 − u
                                                                                   uk ω          ˆk )ˆ
                                                                                                     ωk )   (5.25)
                                                       k=1
                                      K
                                      X
                                  +         (D + 1 − u
                                                     ˆk ω
                                                        ˆ k )Eq [log uk ]
                                      k=1
                                      K
                                      X
                                  +         (D(K + 1 − k) + γ − (1 − u   ωk )Eq [log 1 − uk ]
                                                                     ˆk )ˆ
                                      k=1


5.2.4     Document-specific term of the allocation objective
The term LHDP-doc of our overall objective Lalloc gathers any remaining terms not included in Lentropy
or LHDP-top . Standard conjugate exponential family mathematics yields an expanded expression:
                                      D       h
                            ˆ uˆ) =
                                      X                             p(πd ) i
              LHDP-doc (ˆ
                        r , θ,              Eq log p(zd |πd ) + log          − DEq [cDir (απ G (u))]        (5.26)
                                                                    q(πd )
                                      d=1
                                      D
                                      X
                                 =          −cDir (θˆd )                                                    (5.27)
                                      d=1
                                          D X
                                          X K                                 
                                      +              Ndk (ˆ           u) − θˆdk Eq(πd |θˆd ) [log πdk ]
                                                          r ) + απkG (ˆ
                                          d=1 k=1
                                          D 
                                          X                      
                                      +            G
                                                 απ>   u) − θˆd>K Eq(πd |θˆd ) [log πd>K ]
                                                    K (ˆ
                                          d=1
                             PTd
Here, we define Ndk (ˆ
                     r) =       t=1 r
                                    ˆdtk ,      which is interpreted as the effective count of assignments to
cluster k in document d. We have also substituted πkG (ˆ
                                                       u) , Eq(u) [πkG (u)] in the sum over active
               G                G
clusters, and π>K (ˆ
                   u) = Eq(u) [π>K (u)] in the last line, where we remember that scalar uk is an
                                                                                                                    113


unknown random variable while u
                              ˆk is a free parameter which defines its approximate posterior mean.
All expectations taken with respect to q(πd ) are defined in Eq. (5.12) and Eq. (5.13) as digamma
             ˆ Finally, the function cDir is the Dirichlet cumulant defined earlier in Eq. (5.22).
functions of θ.


5.2.5        Assignment entropy term of the allocation objective
The entropy of the assignments is a simple non-linear function of the responsibilities:

                                                 X Td
                                                 D X                              N
                                                                                  X
                              Lentropy (ˆ
                                        r) = −             Eq [log q(zdt )] = −         rˆdtk log rˆdtk .         (5.28)
                                                 d=1 t=1                          n=1

We emphasize that no document-specific indexing is required here. Thus, we may replace the two-
level document-token indices d, t with the global data index n, which always have a one-to-one
invertible relationship. Written using the global indices n for each data observation, this entropy
expression is no different than the corresponding entropy term for DP mixture models.


5.3        Update steps for variational optimization
Now, given the objective functions defined above, we can write down the the constrained optimization
problem for our free parameters τˆ, νˆ, u
                                        ˆ, ω   ˆ rˆ given an observed dataset x and hyperparameters H:
                                           ˆ , θ,

      arg max           Ldata (x, rˆ, τˆ, νˆ) + Lentropy (ˆ
                                                          r ) + LHDP-doc (ˆ   ˆu
                                                                          r , θ, ˆ) + Lsurrogate-HDP-top (ˆ
                                                                                                          u, ω
                                                                                                             ˆ)   (5.29)
      τˆ,ˆ
         ν ,ˆ
            u,ˆ  ˆr
              ω ,θ,ˆ

where the required constraints on the local free parameters for document d are:
                                      PK
                       rˆdt ≥ 0 and     k=1 r
                                            ˆdtk   =1                        for t = 1, 2, . . . Td               (5.30)
                       θˆdk ≥ 0                                              for k = 1, 2, . . . K, K + 1

and the required constraints on global free parameters are

                             uˆk ∈ [0, 1] and ω
                                              ˆk ≥ 0                               for k = 1, 2, . . .            (5.31)
                                  νˆk ≥ 0 and τˆk ∈ M                              for k = 1, 2, . . .

Given the internal structure of the objective, we pursue a block-coordinate ascent algorithm, which
proceeds in two steps, a local step and a global step. The local step updates the free parameters
rˆd , θˆd for each document d while holding global parameters fixed. The global step updates the global
free parameters given fixed (summary statistics of) local parameters.
   Below, we define the optimization problem solved by each step of the block-coordinate ascent
algorithm. We then describe detailed solutions which solve each problem below in closed-form in
the following sections.
                                                                                                      114


Key differences from coordinate ascent for DP mixtures

We stress two key differences from the DP mixture model optimization procedure of Ch. 3. These
are useful to have in mind from the outset, to properly understand why the HDP training problem
may be more complex than the DP training problem even beyond the fact that there are simply
more latent variables.
   The first key difference is the DP training algorithm had an objective function which was exactly
equal to log p(x) − KL (qkp). In contrast, the HDP training objective LHDP is a lower bound of this
quantity. We need this surrogate bound for tractability, but its consequence is that more complex
models are penalized more sharply than in the DP case.
   The second key difference lies in the guarantees. The DP training algorithm guaranteed mono-
tonic increase of the objective function L, we can only make this guarantee for the HDP topic model
under some costly assumptions: the vector θˆd is always stored for each document.


5.3.1    Global parameter update step for observation model
Consider the optimization problem:

         arg max Ldata (x, rˆ, τˆ, νˆ)    subject to νˆk ≥ 0 and τˆk ∈ M for k = 1, 2, . . . .      (5.32)
            τˆ,ˆ
               ν

The optimization problem in Eq. (5.32) is the same as for the observation model in the DP mixture
model, found in Eq. (3.39). This, we can apply the solution from (3.41) without complication.

                                             τˆk∗ = Sk (x, rˆ) + τ¯                                 (5.33)
                                             νˆk∗   = Nk (ˆ
                                                          r ) + ν¯

These optimal values naturally satisfy the required constraints νˆk∗ ∈ R+ and τˆk∗ ∈ M, so that νˆk∗
and τˆk∗ remain valid parameters for the density q(φk ).


5.3.2    Global parameter update step for allocation model
The allocation model global step solves the following optimization problem:

                   arg max LHDP-doc (ˆ   ˆu
                                     r , θ, ˆ) + Lsurrogate-HDP-top (ˆ
                                                                     u, ω
                                                                        ˆ)                          (5.34)
                     ˆ ,ˆ
                     u  ω

            subject to uˆk ∈ [0, 1] for                                           k = 1, 2, . . .   (5.35)
                            ˆ k ≥ 0 for
                            ω                                                     k = 1, 2, . . .   (5.36)

   Due to non-conjugacy in the generative model’s specification that πd ∼ Dir(απ G (u)), the global
step update for q(u) does not have closed form. However, we can evaluate the objective function of
the problem in Eq. (5.34) given any valid free parameters uˆ, ω
                                                              ˆ . This suggests a numerical optimiza-
tion approach based on L-BFGS or similar modern gradient descent algorithms.
                                                                                                      115


Gradient descent update for mean parameter u
                                           ˆ

Writing the complete objective from the optimization problem in Eq. (5.34) as a function of u
                                                                                            ˆ, we
have:
                            K
                            X
                   u) = −
                f (ˆ               cBeta (ˆ  ˆ k , (1 − u
                                          uk ω          ˆk )ˆ
                                                            ωk )                                   (5.37)
                            k=1
                          K
                          X
                      +         (D + 1 − u
                                         ˆk ω
                                            ˆ k )[ψ(ˆ  ˆ k ) − ψ(ˆ
                                                    uk ω         ωk )]
                          k=1
                          K
                          X
                      +         (D(K + 1 − k) + γ − (1 − u   ωk )[ψ((1 − u
                                                         ˆk )ˆ               ωk ) − ψ(ˆ
                                                                         ˆk )ˆ        ωk )]
                          k=1
                                                    K
                                                    X
                          G
                      + απ>          ˆ +α                 πkG (ˆ     ˆ
                           K (ˆ
                              u)P>K (θ)                        u)Pk (θ)
                                                    k=1

where we define the summary statistics:
                                     D
                                     X                       D
                                                             X
                              ˆ ,                                               PK+1
                          Pk (θ)           Eq [log πdk ] =         ψ(θˆdk ) − ψ( ℓ=1 θˆdℓ )        (5.38)
                                     d=1                     d=1
                                     D
                                     X                        D
                                                              X
                           ˆ ,                                                    PK+1
                      P>K (θ)              Eq [log πd>K ] =         ψ(θˆd>K ) − ψ( ℓ=1 θˆdℓ )
                                     d=1                      d=1

The function f (ˆ
                u) and its gradient (which is easily computed either by hand or with autodifferen-
                                                                                             ˆ∗ . As
tiation code packages) are all that is needed to perform optimization to find the best value u
                                                            ˆ∗ , since the any inactive cluster k > K
before, we need only focus on the first K indices of vector u
                                                         1
will be optimally set to the prior mean: uˆ∗k =         1+γ   for k > K.

Heuristic update for variance parameter ω
                                        ˆ

Close inspection reveals a possible simplification: the free parameter vector ω
                                                                              ˆ , which determines the
variance of q(uk ) at each active cluster, only impacts the term Lsurrogate-HDP-top defined in Eq. (5.25).
After inspecting the optimal global step update for the similar q(u) of the DP mixture model in
Eq. (3.43), he structure of Eq. (5.25) suggests that a useful “rule-of-thumb” update for ω
                                                                                         ˆ is:
                                  
                                  D(K + 1 − k) + D + 1 + γ for k ≤ K
                           ˆ k∗ =
                           ω                                                                   (5.39)
                                  1 + γ                         for k > K

Recall that the larger ω
                       ˆ k is, the less variance exists in the distribution q(uk ). Intuitively, because
index k of vector π G (u) is determined by the first k indices of vector u, the estimated certainty of
q(uk ) should sensibly be highest for index 1 and slowly decay as the index gets larger.
                                              ˆ k∗ above, we can save considerable cost in estimat-
   Using the heuristic closed-form update for ω
    ˆ k∗ at each iteration of any algorithm while still reaping the benefits of representing a proper
ing ω
approximate posterior distribution for q(u) rather than a point estimate.
                                                                                                                                116


5.3.3    Local update step
Given fixed values of the global observation parameters τˆ, νˆ and allocation parameters u
                                                                                         ˆ, ω
                                                                                            ˆ , the
required optimization problem at each document d is to find responsibilities rˆd which define q(zd )
and pseudo-counts θˆd which define q(πd ).

            arg max Ldata (xd , rˆ, τˆ, νˆ) + Lentropy (ˆ
                                                        r ) + LHDP-doc (ˆ   ˆu
                                                                        r , θ, ˆ)                                             (5.40)
              rˆd ,θˆd
                                   PK
   subject to rˆdt ≥ 0 and            k=1 r
                                          ˆdtk   =1                                                  for t = 1, 2, . . . Td
              θˆdk ≥ 0                                                                               for k = 1, 2, . . . K + 1
                                                                                                                              (5.41)

Unlike DP mixture models, this problem does not have a closed-form solution. The objective in
Eq. (5.40) is non-convex and has many local optima. However, we can apply a block coordinate
                                                                 ˆ then the best θˆ given fixed rˆ, and
ascent algorithm to iteratively estimate the best rˆ given fixed θ,
so on until convergence.

Local substep for assignment responsibilities

Fixing θˆd , the objective for token a in document d reduces to an objective similar to the per-token
objective L for DP mixtures in Eq. (3.46).
                                                         K
                                                                                                           !
                                                         X
                                                   ˆ ,
                         Ldt (xdt , rˆdt , τˆ, νˆ, θ)          rˆdtk                       ˆ − log rˆdtk
                                                                       Wdtk (xdt , τˆ, νˆ, θ)                                 (5.42)
                                                         k=1
                                                          h               i        h        i
                                                ˆ , Eq(φ ) log p(xdt |φk ) + Eq(π ) log πdk
                            Wdtk (xdt , τˆ, νˆ, θ)                                                                            (5.43)
                                                        k                        d


Just as with the DP mixture problem, we can use the RespFromWeights procedure to find the
optimum of Eq. (5.42):

                                                           ∗       eWdtk
                                                         rˆdtk = PK                                                           (5.44)
                                                                        Wdtℓ
                                                                  ℓ=1 e


Local substep for doc-topic pseudo-counts

Next, we consider finding the optimal θˆd for document d, given the requisite constraints and fixed
assignments rˆd . Using Lagrange multiplier methods applied to the objective term LHDP-doc , we find
a closed-form optimum. For each active cluster k ≤ K, we have:

                                                  θˆdk
                                                    ∗
                                                              rd ) + απkG (ˆ
                                                       = Ndk (ˆ            u)                                                 (5.45)

while for the aggregate index >K representing all inactive clusters, we have

                                                          θˆd∗>K = απ>
                                                                     G
                                                                      K (ˆ
                                                                         u)                                                   (5.46)
                                                                                                          117


Algorithm 5.1 Algorithm for local step under an HDP topic model
Input:
    α : positive real, document-topic smoothing scalar
    {πkG (ˆ u)}K  k=1 : expected probability of active topic k under q(u|ˆu, ω
                                                                             ˆ)
       G
    π>K (ˆ  u) : expected probability of aggregate inactive topics.
                       Td
    {{Cdtk }K     k=1 }t=1 : expected log probability of data atom xdt under topic k
                         Cdtk , Eq [log p(xdt |φk )]
Output:
    rˆd : responsibilities for doc d
    [θˆd1 . . . θˆdK θˆd>K ] : topic pseudo-counts for doc d
 1: function LocalStepForDoc(Cd , α, u               ˆ)
 2:      for t = 1, . . . Td do                                               ⊲ Initialize responsibilities rˆd
 3:            for k = 1 . . . K do
 4:                 Wdtk ← Cdtk + log(πkG (ˆ   u))
 5:            rˆdt ← RespFromWeights(Wdt )
 6:      while not converged do                                        ⊲ Iterate between updating θˆ and rˆ
 7:            for k = 1, 2P   . . . K do
                                 Td
 8:                 Ndk ← t=1         rˆdtk
 9:                  ˆ
                    θdk ← Ndk + απkG (ˆ     u)
10:                 Pdk ← ψ(θˆdk )                                       ⊲ Log prior probability E[log πdk ]
11:            for t = 1, 2 . . . Td do
12:                 for k = 1, 2 . . . K do
13:                      Wdtk = Cdtk + Pdk
14:                 rˆdt = RespFromWeights(Wdt )
15:      θˆd>K ← απ>      G
                           K (ˆ
                              u)                                                         ⊲ Set inactive term
16:      return rˆd , θˆd
This algorithm solves the optimization problem defined in Eq. (5.40), producing optimal document-
                                                                      rd ) and pseudocounts θˆd defining
specific responsibilities rˆd defining the assignment posterior q(zd |ˆ
                                      ˆ
the topic frequency posterior q(πd |θd ).
                                                                                                                                              118


                                         single doc ELBO
                                                           -6.1                                             −7.20


                                                                                           ELBO objective
 89.5 64.8      0       0          0                                                                        −7.25
                                                           -6.2
 90.8 64.2      8.4     0          0
 89.8 62.4      8.1     8.1        0                                                                        −7.30          MOfix+SpRestarts
                                                           -6.3
 88.7 61.6      7.2     7.1        5.5                                                                                     MOfix
                                                              0   50    100   150    200                    −7.35
              doc-topic counts                                                                                   0    20    40   60   80   100
        Ndk    for select topics                                  num E step iters                                   num pass thru data

            Figure 5.4: Illustration of restart proposals for HDP topic model local step.
Sparsity-promoting restarts for local update steps on the Science corpus with K = 100. Left:
Example fixed points of the document-topic count summary statistic Ndk for a single document in
the Science corpus. We show only select topic indices out of all K = 100. Center: Trace of a single
document’s objective L during the local step inference for 50 random initializations (dashed lines).
The solid lines show one run with sparsity-promoting moves enabled. This run climbs through the
color coded fixed points in the left plot. Right: Trace plot of the whole-dataset objective L across
many passes through the whole Science corpus. Using sparsity-promoting restarts yields noticeable
improvements in model quality.


Iterative algorithm for joint local step update

To find an ideal joint configuration rˆd∗ , θˆd∗ , we iterate between the updates above until convergence.
An algorithm describing the required iterations is given in Alg. 5.1. This algorithm includes our rec-
ommended heuristic initialization (discussed below) and delivers a final value for the responsibilities
rˆ∗ . From this converged value, we can easily compute the required θˆ∗ .
 d                                                                                                      d


Local step initialization. To initialize the update cycle for a document, we recommend visiting
                                                ′
each token and updating it with initial weight Wdtk = Eq [log p(xdt |φk )] + log πkG (ˆ
                                                                                      u). This initializa-
tion lets the likelihood drive the initial assignments while still incorporating the current best estimate
                                       u). We then alternate updates to rˆ and θˆ until either a maximum
of the global topic probabilities π G (ˆ
number of iterations is reached (typically 100) or the maximum change of all document-topic counts
Ndk falls below a threshold (typically 0.05).


5.3.4     Sparse restart proposals for local step
When visiting document d, the joint inference of θˆ and rˆ can be challenging due to the non-convexity
of the joint inference problem in Eq. (5.40). Many local optima exist even for this single-document
task, as shown Fig. 5.4. A common failure mode occurs when a few tokens are assigned to a rare
“junk” topic. Reasignment of these tokens may not happen during the standard coordinate ascent
updates due to a valley in the objective between keeping the current junk assignments and setting
the junk topic to zero.
     To more adequately escape local optima, we develop sparsity-promoting restart moves which
take a final document-topic count vector [Nd1 . . . NdK ] produced by coordinate ascent, propose an
alternative which has one entry set to zero, and accept if this improves the ELBO after further
                                                                                                                119


Algorithm 5.2 Algorithm for restart proposals used in local step of inference for HDP topic model
Input: α : document-topic smoothing scalar
                        Td
    {{Cdtk }K     k=1 }a=1 : expected log probability of token d, t under topic k
                          Cdtk , Eq [log p(xdt |φk )]
    rˆd : initial responsibilities for doc d
    [θˆd1 . . . θˆdK θˆd>K ] : initial topic pseudo-counts for doc d
Output:
    rˆd : responsibilities for doc d
    [θˆd1 . . . θˆdK θˆd>K ] : topic pseudo-counts for doc d
 1: function RestartProposalsForDoc(ˆ                     rd , θˆd )
 2:      for k = 1, .P     . . K do
                                Td
 3:            Ndk ← t=1             rˆdtk
                                                                                              ⊲ Initial usage counts.
 4:      Ad ← {k : Ndk > 0.1}
 5:      for j ∈ Ad do
 6:            rˆd′ ← Copy(ˆ       rd )
 7:            for t = 1, 2, . . . Td do
                      ′
 8:                 rˆdtj  ←0
                                                                   ⊲ Propose sparser q(πd ) by setting count to zero
 9:            while not converged do
10:                 for k = 1, .P       . . K do
                              ′             Td    ′
11:                      Ndk      ← t=1         rˆdtk
12:                       ˆ′             ′
                         θdk ← Ndk + απkG (ˆ          u)
13:                      Pdk ′
                                 ← ψ(θˆdk    ′
                                               )
14:                 for t = 1, 2, . . . Td do
15:                      for k = 1, 2, . . . K do
                                   ′                    ′
16:                             Wdtk      ← Cdtk + Pdk
                           ′                                    ′
17:                      rˆdt   ← RespFromWeights(Wdt             )

18:        if Ld (ˆrd′ , θˆd′ ) > Ld (ˆ
                                      rd , θˆd ) then           ⊲ Keep proposal if document score improves
19:            rˆd ← rˆd′
20:            θˆd ← θˆd′
21:     return rˆd , θˆd
Restart proposals attempt to find better local optima of the variational optimization objective
function in Eq. (5.40). Based on the heuristic that the document-topic prior prefers sparsity, this
procedure repeatedly proposes alternative local parameters where one currently used topic has its
usage forced to zero. This procedure is guaranteed to monotonically improve the document-specific
objective function.
                                                                                                    120


ascent steps. In practice, the acceptance rate varies from 30-50% when trying the 25 smallest non-
zero topics. We observe huge gains in the whole-dataset objective due to these restarts, as shown in
Fig 5.4, while without them we sometimes even observe decreasing of the overall objective on repeat
visits to the same document.


5.3.5    Specialization to bag-of-words datasets
Several aspects of the algorithm can be simplified or clarified for the popular use-case of training
on discrete bag-of-words datasets. Let each document xd consist of observed word tokens from a
fixed vocabulary of V word types. We can represent xd in two ways. First, as a dense list of the Td
word tokens in document d: xd = {xdt }Tn=1
                                        d
                                           . Each value xdt ∈ {1, . . . V } identifies the type of the
t-th word in the document. Second, we can use a memory-saving sparse histogram representation:
xd = {vdu , cdu }U
                 u=1 , where u indexes the set of word types that appear at least once in the document,
                  d


vdu ∈ {1, . . . V } gives the integer id of word type u, and cdu ≥ 1 is the count of word type vdu in
                            P d          PTd
document d. Naturally, U      u=1 cdu =    t=1 xdt = Td .


Sharing parameters by word type.             Naively, tracking the assignments for document d requires
explicitly representing a separate K-dimensional distribution q(zdt ) for each of the Td tokens. How-
ever, we can save memory and runtime by recognizing that given a specified document-topic ap-
proximate posterior q(πd |θˆd ), the corresponding update for the assignment posterior q(zdt ) will have
shared structure for any tokens of the same type. That is, if there exist two token indices s and
t in document d such that xds = v and xdt = v, then by definition after a local update their re-
sponsibilities will be exactly the same: rˆds = rˆdt . We can thus share parameters with no loss in
representational power. Instead of Td total responsibility vectors, we need only Ud distinct vectors,
one for each unique vocabulary word appearing at least once in the document. This leads to sub-
stantial savings in required storage as well as required computation, because frequently Ud << Td
for real text data due to word burstiness.


5.4     Algorithms for HDP topic model posterior estimation
5.4.1    Full-dataset variational
When the whole dataset can fit in memory, a complete algorithm for estimating the global parameters
of the HDP topic model from data is given in Alg. 5.3. The algorithm alternates between the local
step for each document (a subprocedure specified in Alg. 5.1) and the global step, which updates
both the allocation parameters u
                               ˆ, ω
                                  ˆ and observation parameters τˆ, νˆ. Generally, this procedure is
run to convergence at a fixed point local optima. One useful criteria for judging convergence is to
monitor the assignment counts Nk for each active cluster over many iterations. Convergence can
be measured by checking whether the maximum absolute change in any entry of this vector drops
below a small threshold.
                                                                                                                      121


Algorithm 5.3 Variational coordinate ascent for HDP topic model
Input:
    {xd }D  d=1 : dataset with D documents
    K: truncation level
    {ˆτk , νˆk }Kk=1 : initial global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : initial global parameters of allocation model
    γ, α: allocation model hyperparameters
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }Kk=1 : updated global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : updated global parameters of allocation model
 1: function VariationalCoordAscentForHDPTopic(x, K, τ                                  ˆ, νˆ, uˆ, ω
                                                                                                   ˆ)
 2:       while not converged do
 3:            for d ∈ 1, 2, . . . D do                                                          ⊲ Local step at document d
 4:                 for t ∈ 1, 2, . . . Td do
 5:                        for k ∈ 1, 2, . . . K do
 6:                           Cdtk ← Eq [log p(xdt |φk )]
 7:                        ˆ
                    rˆd , θd ← LocalStepForDoc(Cd , α, u           ˆ)
 8:                 rˆd , θˆd ← RestartProposalsForDoc(ˆ              rd , θˆd , Cd , α, u
                                                                                         ˆ)
 9:            for k ∈ 1,P    2, . . . K  do                                                                ⊲ Summary step
                                D PTd
10:                 Sk ← d=1 t=1               rˆdtk s(xdt )
                              P        PTd
11:                 Nk ← D       d=1      t=1 r ˆdtk
                              PD                       PK+1
12:                 Pk ← d=1 ψ(θdk ) − ψ( ℓ=1 θˆdℓ )
                            PD                         PK+1 ˆ
13:            P>K ← d=1 ψ(θd>K ) − ψ( ℓ=1 θdℓ )
14:            for k ∈ 1, 2, . . . K do                                      ⊲ Global step for observation parameters
15:                 τˆk ← Sk + τ¯
16:                 νˆk ← Nk + ν¯
17:            uˆ ← arg maxuˆ f (ˆ      u, ω       ˆ α, γ)
                                           ˆ , P (θ),         ⊲ Global step for allocation parameters, via L-BFGS
          return τˆ, νˆ, uˆ, ω   ˆ
Full dataset algorithm for approximate posterior inference for the HDP topic model in Fig. 5.1. For
the LocalStepForDoc procedure, see Alg. 5.1.


5.4.2      Stochastic variational
The application of stochastic variational inference (Hoffman et al., 2013) to our direct assignment
HDP topic model optimization problem in Eq. (5.29) is almost straightforward. The natural gradient
update for the observation model parameters τˆ, νˆ applies without change from the DP mixture model
case due to conditional conjugacy. However, the non-conjugate relationship between q(u) and q(πd )
does not allow the direct application of the natural gradient update. Nevertheless, we develop a
modified stochastic algorithm given in Alg. 5.4.


5.4.3      Memoized variational
A memoized algorithm for the HDP topic model variational optimization problem is in Alg. 5.5.
                                                                                                               122


Algorithm 5.4 Stochastic variational coordinate ascent for HDP topic model
Input:
    {xd }D  d=1 : dataset with D documents
    K: truncation level
    {ˆτk , νˆk }Kk=1 : initial global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : initial global parameters of allocation model
    γ, α: allocation model hyperparameters
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : updated global parameters of allocation model
 1: function StochasticVariationalForHDPTopic(x, K, τ                        ˆ, νˆ, u
                                                                                    ˆ, ω
                                                                                       ˆ)
 2:       for iteration i ∈ 1, 2, . . . do
 3:            Di ← SampleWithoutReplacement({1, 2, . . . D}, D/B)
 4:            for d ∈ Di do                                                          ⊲ Local step at current batch
 5:                 for a ∈ 1, 2, . . . Ad do
 6:                        for k ∈ 1, 2, . . . K do
 7:                           Cdtk ← Eq [log p(xdt |φk )]
 8:                 rˆd , θˆd ← LocalStepForDoc(Cd , α, u   ˆ)
 9:                        ˆ
                    rˆd , θd ← RestartProposalsForDoc(ˆ        rd , θˆd , Cd , α, u
                                                                                  ˆ)

10:          for k ∈ 1,P
                       2, . . . KPdo                                        ⊲ Summary step at current batch
                                   Td
11:             Sk ← d∈Di t=1         rˆdak s(xda )
                       P         PTd
12:             Nk ← d∈Di t=1 rˆdak
                       P                      PK+1
13:             Pk ← d∈Di ψ(θdk ) − ψ( ℓ=1 θˆdℓ )
                     P                        PK+1
14:          P>K ← d∈Di ψ(θd>K ) − ψ( ℓ=1 θˆdℓ )

15:          ξi ← (δ + i)−κ                                                             ⊲ Update learning rate
16:          for k ∈ 1, 2, . . . K do                    
                                                D
17:             τˆk ← (1 − ξi )ˆ  τk + ξi τ¯ + |D i|
                                                     S tk          ⊲ Global step for observation parameters
                                                          
                                                D
18:             νˆk ← (1 − ξi )ˆ  νk + ξi ν¯ + |Di | Ntk
19:          uˆi ← arg maxuˆ f (ˆ
                                u, ω    D
                                   ˆ , |D       ˆ α, γ)
                                             P (θ),                  ⊲ Global step for allocation parameters
                                          i|

      return τˆ, νˆ, u
                     ˆ, ω
                        ˆ
Stochastic variational algorithm for approximate posterior inference for the HDP topic model in
Fig. 5.1. For the LocalStepForDoc, see Alg. 5.1.
                                                                                                                123


Algorithm 5.5 Memoized variational coordinate ascent for HDP topic model
Input:
    {xd }D  d=1 : dataset with D documents
    K: truncation level
    {ˆτk , νˆk }Kk=1 : initial global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : initial global parameters of allocation model
    γ, α: allocation model hyperparameters
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : updated global parameters of allocation model
 1: function MemoizedVariationalForHDPTopic(x, K, τ                     ˆ, νˆ, u
                                                                               ˆ, ω
                                                                                  ˆ)
 2:       S G, N G, T G ← 0                                                            ⊲ Initialize global statistics
 3:       for batch b ∈ 1, 2, . . . B do
 4:            Sb , N b , T b ← 0                                                       ⊲ Initialize batch statistics
 5:       for lap ℓ ∈ 1, 2, . . . do
 6:            for batch b ∈ Shuffle({1, 2, . . . B}) do
 7:                 for doc d ∈ Db do                                               ⊲ Local step at current batch
 8:                      for t ∈ 1, 2, . . . Td do
 9:                             for k ∈ 1, 2, . . . K do
10:                                Cdtk ← Eq [log p(xdt |φk )]
11:                             ˆ
                         rˆd , θd ← LocalStepForDoc(Cd , α, uˆ)
12:                      rˆd , θˆd ← RestartProposalsForDoc(ˆ    rd , θˆd , Cd , α, uˆ)

13:              S G ← S G − Sb                                         ⊲ Decrement previous batch statistics
14:              N G ← N G − Nb
15:              T G ← T G − Tb

16:              for k ∈ 1, 2,
                            P. . . K P
                                     do                                      ⊲ Summary step at current batch
17:                 Sbk ← d∈Db A        d
                                          rˆdtk s(xda )
                            P        Pa=1
                                       Ad
18:                 Nbk ← d∈Db a=1 rˆdtk
                            P                     PK+1
19:                 Pbk ← d∈Db ψ(θdk ) − ψ( ℓ=1 θˆdℓ )
                         P                        PK+1
20:              Pb>K ← d∈Db ψ(θd>K ) − ψ( ℓ=1 θˆdℓ )

21:              S G ← S G + Sb                                               ⊲ Increment new batch statistics
22:              N G ← N G + Nb
23:              T G ← T G + Tb

24:              for k ∈ 1, 2, . . . K do
25:                  τˆk ← τ¯ + SkG                                  ⊲ Global step for observation parameters
26:                  νˆk ← ν¯ + NkG
27:              uˆ ← arg maxuˆ f (ˆ     ˆ , P G , α, γ) ⊲ Global step for allocation parameters, via L-BFGS
                                      u, ω
      return τˆ, νˆ, u
                     ˆ, ω
                        ˆ
For the LocalStepForDoc, see Alg. 5.1. Memoized algorithm for approximate posterior inference
for the HDP topic model in Fig. 5.1.
                                                                                                                                124


Memoized algorithm and monotonicity guarantee

In general, the objective function L is guaranteed to monotonically increase after any global step, but
not after the local step. This occurs because the local step requires jointly estimating the assignment
responsibilities rˆd and document-topic probability pseudo-counts θˆd from scratch when visiting a
document d. Because this local step problem is non-convex even given fixed global parameters, we
cannot guarantee that on subsequent visits to a document d we will always improve the objective.
Instead, it may be the case that even after restart proposals, the fixed point rˆd , θˆd leads to a drop
in the L score. We find that this is rare, but it can happen.
   Without a procedure to always deliver a guaranteed global optima to the local step problem,
the only way to guarantee monotonic increase of L is to warm-start the local step iterations at a
previous value of rˆd or θˆd stored from a previous visit to the document d. However, this requires
tremendous memory. For example, we’d need to store the value of θˆdk for all K active topics and D
documents. With thousands of topics and millions of documents, this becomes infeasible.

Memoized evaluation of the objective

As in DP mixture models, after completing the first epoch or lap of memoized inference in Alg. 5.5,
the global statistics N G , S G , T G accurately represent every document in the dataset and thus we
can compute the score of the whole objective L at any point beyond lap ℓ = 1. We do need several
auxiliary statistics, defined below.

Entropy statistic for ELBO computation. Like in DP mixture models, we require tracking the
assignment entropy of each active topic in each batch. Let Hbk define the sum of assignment entropies
                                                                                             P
within batch b for topic k. Then given the whole-dataset aggregated entropy values HkG = b Hbk ,
we can easily evaluate Lentropy from Eq. (5.28) as:
                       K
                       X                                 B
                                                         X                                      X X
      Lentropy (ˆ
                r) =         HkG (ˆ
                                  r ),           HkG =         Hbk ,               r) , −
                                                                              Hbk (ˆ                      rˆdtk log rˆdtk .   (5.47)
                       k=1                               b=1                                   d∈Db   t


Document-specific statistics for ELBO computation.                                       The document-specific term of the
allocation model LHDP-doc in Eq. (5.26) can be rewritten using the statistic vector Q defined below:
                                                                K
                                                                X                       K
                                                                                        X
        LHDP-doc (ˆ   ˆu
                  r , θ, ˆ) = QG  ˆ     G   ˆ                          QG      ˆ              απkG (ˆ     ˆ
                               0 (θ) + Q>K (θ) +                        k (ˆ
                                                                           r , θ)   +               u)Pk (θ)                  (5.48)
                                                                k=1                     k=1
                                   X                            X             PK+1
                        QG
                         0 ,             Qb0 ,       Qb0 =              log Γ( ℓ=1 θˆdℓ )
                                   b                            d∈Db
                                   X                             X
 ∀k ∈ {1, 2, . . . K} QG
                       k ,               Qbk ,       Qbk =              − log Γ(θˆdk ) + (Ndk (ˆ
                                                                                               r ) − θˆdk )[ψ(θˆdk ) − ψ(θˆd· )]
                                   b                            d∈Db
                                   X                                   X
                       QG
                        >K
                               ,         Qb>K ,          Qb>K =               − log Γ(θˆd>K ) − θˆd>K [ψ(θˆd>K ) − ψ(θˆd· )]
                                   b                                   d∈Db
                                                                                                     125


Thus, with modest storage for Qb (a K + 2 dimensional vector) and Hb (a K dimensional vector)
required at each batch b, we can compute the whole dataset objective L exactly despite processing
data one batch at a time.


5.5     Variational algorithms with proposal moves that adapt
        the number of clusters
We now develop a framework for merge, birth, and delete proposal moves for the HDP topic model,
building on our earlier efforts for the DP mixture model. Previous efforts for MCMC sampling
inference using transition operators that explicitly modify the number of clusters include the split-
merge proposal methods of Wang and Blei (2012b), as well as the more recent split-merge sampler
by Chang and Fisher III (2014). Among previous methods that adaptively add or remove clusters
within a variational framework, two lines of work stand out. First, Wang and Blei (2012a) used a
Gibbs sampler inner-loop for the local step, which could add or remove clusters but will generally
be slower to make large changes that well-designed explicit proposals. Second, Bryant and Sudderth
(2012) offer an approach similar to ours: explicit split and merge proposals which are either accepted
or rejected based on improvements to the optimization problem’s objective function. However,
by using stochastic variational inference, they cannot make accept or reject decisions which are
consistent with the entire dataset. Our approach can simultaneously make bigger changes than the
Gibbs-within-variational inference methods while offering guarantees of improved model quality not
possible with stochastic methods.


5.5.1    Merge proposals
                                                                   ˆ uˆ, ω
Each merge proposal transforms a current variational posterior rˆ, θ,    ˆ , τˆ, νˆ into a candidate with
one fewer cluster by combining two chosen topics kA , kB into a single merged topic. These moves can
eliminate redundant topics and improve the final model’s interpretability. Accepted merge proposals
also make subsequent iterations of our algorithm faster by reducing the number of active topics, since
all steps of inference scale linearly with K.

Merge proposal construction

Just like with DP mixture models, given a pair of clusters kA , kB to merge, we create the candidate
state in two steps. First, we create the local parameters rˆ′ , θˆ′ and their accompanying sufficient
statistics. Second, we instantiate values for the candidate global parameters, which are optimal
under our objective given the proposed local parameters. A fully-constructed proposal can then be
checked by the objective L to decide acceptance or rejection.
                                                                                                                                  126


                                                                             Accepted Merge             Correlation Score 0.79!
                Anchor                     + Variational                      674.2   series           734.1   film
              0.009 ball                   0.018 model                        629.5   song             354.8   magazine
              0.008 university             0.013 computer                     573.5   release          328.0   direct
              0.007 says                   0.012 models                       519.8   star             313.2   production
              0.006 science                0.011 problem                      489.1   television       296.1   actor
              0.006 new            10      0.010 time
                                 passes                                       388.1   york             281.8   career
                                  thru                                        385.0   award            269.7   hollywood
              0.022 birds        dataset   0.019 birds                        371.4   friend           268.2   appeared
              0.009 new                    0.018 evolution
              0.009 university             0.016 evolutionary                Accepted Merge             Correlation Score 0.54!
              0.009 says                   0.012 species                     1092.4   language        154.7    linguistic
              0.007 years                  0.010 molecular                    364.4   latin           137.9    linguist
                                                                              345.5   letter          122.5    language
              0.017 silicate               0.016 isotopic                     332.4   dialect         122.4    speech
              0.010 metal                  0.013 composition                  303.7   speak           103.1    linguistics
              0.010 high                   0.012 ratios                       296.1   speaker         100.9    grammatical
              0.009 melt                   0.012 isotope                      290.7   sound            75.1    pronunciation
              0.007 water                  0.012 silicate                     265.4   verb             71.7    suffix

                                                         32682       21165            32612         69562          58392
                    Accepted Delete                     math
                                                        function
                                                                    science
                                                                    theory
                                                                                  code
                                                                                  language
                                                                                                   process
                                                                                                   theory
                                                                                                                  design
                                                                                                                  engine
                     Tokens from deleted topic          theorem     scientific    computer         human          build
                     reassigned to remaining topics,    define      mathematics   program          information    speed
                     in document-specific fashion.      theory      scientist     programming      method         drive
                                                        property    research      machine          approach       reduce
                   Size: 4611 tokens
                  100.4    engineering      doc A
                                                        16.05       42.78         17.56            19.09          7.11
                   84.9    science
                   64.5    computer         doc B
                   53.0    field                         9.43       40.88             0            20.61         11.29
                   50.1    machine
                   49.8    mechanical       doc C        0            0               0            35.86          0
                   42.9    scientific
                   42.0    discipline       doc D        3.77       36.10         30.63            16.70          0
                   39.8    analysis
                   39.3    mathematics                       Net change in doc-topic count    Ndk after delete


               Figure 5.5: Practical examples of merges and deletes on topic models
Top Left: Anchor topics (Arora et al., 2013) can be improved significantly by variational updates.
Top Right: Topic pairs accepted by merge moves during run on Wikipedia. Combining each pair into
one topic improves our objective L, saves space, and removes redundancy. Bottom Row: Accepted
delete move during run on Wikipedia. Red topic is rarely used and lacks semantic focus. Removing
it and reassigning its mass to remaining topics improves L and interpretability.


Local construction. We use the deterministic addition in Eq. (4.1) for creating rˆ′ from rˆ. Simi-
larly, we have a deterministic addition for creating θˆ′ from θ.ˆ For each document d, we have:
                                                    
                                                    θˆdkA + θˆdkB
                                                    
                                                                      if k = kA
                                                    
                                               ˆ′
                   for k = 1, 2, . . . K − 1 : θdk = θˆdk              else if k < kB           (5.49)
                                                    
                                                    
                                                    
                                                    θˆ
                                                       dk+1            else if k ≥ kB

We also keep the pseudo-count value for the inactive topics θˆd>K the same.

Summary statistic construction.                                                                    r′ ) and Sk (ˆ
                                                       As in DP mixture models, the statistics Nk (ˆ            r′ , x)
can be constructed easily given their original values, as in Eq. (4.2). We also must compute the
                                      r′ ), which again is the same computation as the DP mixture
entropy of the assignment posterior H(ˆ
model.
                                                                                                        127


                             ˆ such as the aggregate log probabilities P (θ)
   The summary statistics of θ,                                           ˆ used for the global step
                ˆ needed for the evaluation of LHDP-doc , need to be computed specifically for each
and the terms Q(θ)
                                                               ˆ
merge pair kA , kB , because these are non-linear functions of θ.

Global construction.         As in DP mixtures, given the summary statistics for the merged local
parameters, we can create proposed global parameters simply by executing the appropriate global
parameter optimization steps.

Selecting candidate cluster pairs to try merging

Past work (Bryant and Sudderth, 2012; Hughes et al., 2015a) suggests that one viable way to select
candidate pairs is to use the empirical correlation of two candidate topics across all document-specific
count vectors {Nd }D
                   d=1 .


                           score(kA , kB ) = Corr(N:kA , N:kB ), − 1 < score < 1.                    (5.50)

Large scores identify topic pairs frequently used in the same documents, which may be a useful
signal for potential merges.
   While useful, we now recommend a version of the selection score used for DP mixture models,
which uses information about the change in Ldata under a proposal to suggest which terms to track.


5.5.2     Delete proposals
Delete moves provide a more powerful alternative to merges for removing rarely used “junk” top-
ics. For an illustration of an accepted delete proposal in the context of training a topic model on
Wikipedia data, see Fig. 5.5. After identifying a candidate topic with small mass to delete, we
reassign all its tokens to the remaining topics. This move can succeed when a merge would fail
because each document’s tokens can be reassigned in a customized, document-specific fashion, as
shown in Fig. 5.5.
   Our first work on deletes for topic models was published in an AISTATS ’15 conference paper
(Hughes et al., 2015a). We reproduce the ideas from that paper on delete proposals here. First, we
outline how a delete move would work if we could afford explicitly updating all documents in the
dataset. Next, we describe how deletes work in our memoized framework, where we use a heuristic
delete proposal in the sense that it did not construct valid parameters rˆ′ , θˆ′ which represented the full
dataset, or evaluate a whole-dataset objective function. Nevertheless, we find decent performance
in practice.

Whole-dataset delete construction

Delete moves remove some topic, indexed by j, from a current set of parameters and sufficient
statistics of size K + 1. For simplicity, this explanation assumes j is last in index order, but in fact j
can be at any position. The move constructs new parameters and new sufficient statistics S ′ , N ′ of
                                                                                                        128


size K, where any mass assigned to topic j has been reallocated among the other topics. Here, unlike
the merge move, we have no one-step rule for constructing the candidate local parameters. Instead,
we use a heuristic initialization followed by refinement coordinate ascent updates. We initialize
sufficient statistics by simply removing any entries associated with topic j.

                                       Original: N = [N1       ...   NK Nj ]                         (5.51)
                                                    ′
                               Candidate Init: N = [N1         ...   NK ]

Given these initial summaries, we take a global step to create K candidate global parameters. The
main paper’s Fig. 1 reviews the big picture for how summary statistics lead to global parameter
updates. After creating the candidate global parameters, we realize that τˆ′ will have exactly the
same first K topics as the original model. For ρˆ′ , ω
                                                     ˆ ′ , the resulting E[β] will be similar, too. Next, a
local step reassigns all tokens (including those ignored) among the K remaining topics. After one
more global step, we have a viable candidate model q ′ representing the whole dataset. This model
can be compared to the original, and kept if the objective improves.

Memoized delete construction

For large datasets, it is infeasible to perform several local/global update cycles for all documents
just to evaluate one candidate move. A more scalable delete move is possible because we assume
junk topic j has only a small subset of documents with appreciable mass, while most documents
assign Ndj ≈ 0. Thus, only the small set satisfying Ndj > ǫ need to be edited explicitly, where we
set ǫ = 0.01.
     The memoized delete move happens in three steps. First, we gather all documents satisfying this
threshold test into a target dataset during a standard pass of the dataset. Second, we construct the
delete candidate model q ′ from the target set, performing the simple construction described above
while holding the non-target sufficient statistics fixed. That is, for each additive sufficient statistic
vector N, S, T in the previous model, we create candidates N ′ , S ′ , T ′ that satisfy the following
relation:

                              Nk′ = Nk − Nkbef ore + Nkaf ter , k ∈ {1 . . . K}                      (5.52)

Here, Nkbef ore is the statistic for topic k on the target set before removing j, and Nkaf ter is the
computed statistic on the target set after removing j and performing the several updates.
  For the specific case of the token count statistic N ′ on the target set, we know that Nǫ +
PK     ′
                PK
 k=1 Nk = Nj +     k=1 Nk where Nǫ represents the small mass assigned to j from documents that
did not pass the threshold test. If accepted, sufficient statistic vector N ′ will soon accurately reflect
all data (including the small discarded mass) after a complete pass of local and global steps at all
batches.
     To determine acceptance, we evaluate the objective L(·) using candidate global parameters
 ′
u  ˆ ′ , τˆ′ , νˆ′ , which are obtained via direct updates from N ′ , S ′ , P ′ . For the local arguments to
ˆ ,ω
L(·), we use the inferred parameters rˆd , θˆd from documents in the target set.
                                                                                                   129


   If the candidate model improves this objective, we accept it. After accepting, we need to adjust all
stored batch-specific summaries to reflect the new model. Otherwise, our new aggregate summaries
will not be consistent with the sum of stored batch summaries, and subsequent incremental updates
will be invalid. We thus edit the stored statistics for each batch to reflect the final state of the
target-set documents from that batch.
   Immediately after a delete move, we do not have the required ELBO summaries to exactly
compute the bound after visiting the next batch. However, after completing a complete lap through
all batches, the relevant summaries will be refreshed and the ELBO computable.

Selecting topics to delete

Delete move costs scale with the number of documents in the target set. We specify a maximum
cap for the total documents we can afford to process as a target set as 500. Any topic occuring in
fewer than 500 documents is eligible for deletion. We select among this eligible set as many topics as
possible until the total cap is reached, and build the target set as the union of all documents passing
the threshold test for any selected topic. This allows potentially multiple topics to be deleted in one
pass through the data, each one considered independently, while never exceeding the specified cap
on target set size.


5.5.3     Birth proposals
HDP topic models have two sets of local parameters that come from every proposal: assignments
rˆ′ , which has one vector per data atom, and probability pseudo-counts θˆ′ , which has one vector
per document d. As in DP mixtures, we can easily handle create coherent summaries N (ˆ
                                                                                     r ), S(ˆ
                                                                                            r ) of
assignments across batches with different truncation levels by inserting zeros. However, each batch
                       ˆ which is a non-linear function of the pseudo-count local parameters:
requires a summary P (θ)
                                                                              X
                           ˆ = Eq [log πdk ] = ψ(θˆdk ) − ψ(θˆd· ),
                      Pdk (θ)                                         Pbk =          Pdk        (5.53)
                                                                              d∈Db

Each entry Pbk can be interpreted as an aggregated log probability. If topic k is popular across
batch b, the value of Pbk will be a small magnitude negative number, while as topic k becomes more
rare, the value of Pbk will grow toward negative infinity.
   To build intuition for birth proposals with an HDP allocation model, we consider an example for
the d-th document. Below, we show possible before and after values for four interrelated variational
parameter vectors: the expected global topic probabilities π G , aggregated assignment counts N (ˆ
                                                                                                 rd ),
                        ˆ
variational parameters θd , and log probabilities Pd . We assume the original model has 3 active
clusters, and the proposal splits the third cluster into two new active clusters, which are placed last
                                                                                                                           130


in stick-breaking order.

                    BEFORE :          π G = [ 0.4,              0.3, 0.2], 0.1                                          (5.54)
                                   N (ˆ
                                      rd ) = [10.0, 50.0, 5.0]
                                         θˆd = [10.4, 50.3, 5.2], 0.1
                                   P (θd ) = [ -1.889, -0.274,                      -2.633]
                                   U (θd ) = [            0.000,           0.000, -14.606]
                                         ′
                                             G
                        AFTER :      π           = [ 0.4,       0.3, 0.02, 0.09, 0.09], 0.1
                                      rd′ ) = [10.0, 50.0, 0.00, 3.00, 2.00]
                                   N (ˆ
                                         θˆd′ = [10.4, 50.3, 0.02, 3.09, 2.09], 0.1
                                   P (θd′ ) = [ -1.889, -0.274, -54.727, -3.224,                         -3.703]
                                   U (θd′ ) = [           0.000,           0.000,    0.000,    0.000, -14.606]          (5.55)

The first step of constructing this proposal is creating a temporary value for probability vector π G .
Our current variational state defines an approximate posterior q(u) which implies a distribution on
π G . We begin by setting π G to its expected value under this posterior. This gives us a concrete
vector of size K, with one entry for each of the active clusters. Remember that we also have one
                                                                                                                            ′
                                                                                                                                G
extra entry that aggregates the mass of all inactive clusters. Next, we create an expanded vector π
of size K + J ′ by copying π G but redistributing the original target cluster mass uniformly among
                                                                   ′
                                                                       G
the new J’ clusters. The proposed value for π                              leaves only a small fraction ǫ of mass at the target
cluster, and keeps the inactive mass unchanged.
                                                  ′
                                                      G
     Next, we provide this specific π                     as input to the Bregman k-means++ and restricted local step
process, which delivers specific values for assignments rˆd′ at each data unit and their associated
                                                                                                ′
per-document summaries N (ˆ                                     rd′ ) and π G , the local parameter vector
                             rd′ ). Given specific values of N (ˆ
θˆd′ has a closed-form optimal value via the local-step formula: θˆdk
                                                                                       ′
                                                                       ′       ′
                                                                         = Ndk    + απ G . In the example
above, we assumed α = 1. Both summary vectors P ′ and U ′ are easily computed given θˆ′ .
     We may run this proposal process at every document d inside a batch b. This yields the summaries
Nb′ , Sb′ , Pb′ , Ub′   which then naturally can be aggregated and used to construct the global parameters
via Step 2 of Fig. 4.2.

Multi-batch summary tracking for HDP topic models. With more than once batch B > 1,
we need to effectively track the whole-dataset summaries P G , U G that represents batches at different
truncation levels. We cannot simply insert zeros to the vector P G , because these vectors represent log
probabilities, not counts. In fact, there is no finite constant that we can insert in the log probability
domain to represent unassigned mass.
     Rather than perform an operation to expand vectors to one consistent truncation level, we
instead embrace the idea that different truncation levels can exist at different data units or batches
simultaneously. We assign all data units in batch b to the same truncation Kb . We find we can
represent this via coherent whole-dataset statistics P G and U G , each a vector of size K = maxb Kb .
                                                                                                                   131


The vector P G represents active log probabilities, and U G tracks the inactive log probability mass.
They are formally defined as:
                                          X                          X
                             PG =                           ˆ =
                                                       Pbk (θ)               ψ(θˆdk ) − ψ(θˆd: )                 (5.56)
                                      b:Kb ≥k                      d:Kd ≥k
                                          X                          X
                             UG =                           ˆ =
                                                       Ubk (θ)               ψ(θˆd,>k ) − ψ(θˆd: )               (5.57)
                                      b:Kb =k                      d:Kd =k

The aggregated vector U: will have non-zero entries at each unique truncation level existing in some
                                                                     ′
current batch. After a proposal at batch b that transforms Pbk into Pbk , we can create the new
whole-dataset vector P:′ using the following updates for each possible cluster index k ∈ 1, . . . K + J ′ :
                                                       
                           ′                                             ′
          P :k − P   +  P     if k =  ktarget
                                                          U − Ubk + Ubk     if k = ktarget
                   bk
                                                         :k
         
                         bk                            
                                                        
         
     ′                                             ′
   P:k = P:k                    else if k ≤ K    U:k = U:k                    else if k ≤ K        (5.58)
         
                                                       
                                                        
         
         P ′                                           
                                                        U ′
             bk                 else if k > K               bk                else if k > K
                                      ′            ′
                                          G            G
Together, the proposal vectors P              ,U           accurately represent the truncation status of every batch in
the entire dataset. As usual, these vectors are sufficient statistics for a global update for allocation
model parameters.


5.6      Experimental results
Our experiments compare inference methods for fitting HDP topic models. For our new HDP
objective, we study stochastic with fixed K (SOfix), memoized with fixed K (MOfix), and memoized
with deletes and merges (MOdm). For baselines, we consider the collapsed sampler (Gibbs) of
Teh et al. (2006), the stochastic CRF method (crfSOfix) of Wang et al. (2011), and the stochastic
split-merge method (SOsm) of Bryant and Sudderth (2012). For each method, we perform several
runs from various initial K values.
   For each run, we measure its predictive power via a heldout document completion task, as in
Bryant and Sudderth (2012). Each model is summarized by a point-estimate of the topic-word
probabilities φ. For each heldout document d we randomly split its word tokens into two halves:
x′d , x′′d . We use the first half to infer a point-estimate of πd , then estimate log-likelihood of each
token in the second half x′′d .
                                                                 P
                                                                     d∈Dtestlog p(x′′d |πd , φ)
                                  heldout-lik(x|φ) =                    P            ′′                          (5.59)
                                                                          d∈Dtest |xd |


Hyperparameters.         In all runs, we set γ = 10, α = 0.5 and topic-word pseudocount τ¯ = 0.1.
Stochastic runs use the learning rate decay recommended in Bryant and Sudderth (2012): κ =
0.5, δ = 1.
                                                                                                                                 132


                       Example Documents                                                                  MOdm : 1-10/10


                                        0        2     10    500

                                            word count
                  Gibbs : 1-15,25-30 /67                                                               MOfix : 1-15,25-30 /100


                              120                                                          −5.88
                                                                         heldout log lik
               num topics K


                              100
                                                                                           −5.90
                              80
                                                                                           −5.92
                              60
                                                                                           −5.94
                              40
                              20                                                           −5.96

                               0                                                           −5.98
                                    0       50       100    150    200                             0      50   100   150   200
                                num pass thru data                                             num pass thru data


       Figure 5.6: Comparison of HDP topic model inference methods on toy bars dataset.
Comparison of inference methods on toy bars dataset from Sec. 5.6.1. Top row: Word count images
for 7 example documents and the final 10 estimated topics from MOdm. Each image shows all 900
vocabulary types arranged in square grid. Middle row: Final estimated topics from Gibbs sampler
and fixed truncation memoized (MOfix). We rank each algorithm’s final set of topics from most to
least probable in terms of global appearance probability πkG , and show the topics ranked 1-15 and
25-30. Bottom row: Trace plots of the number of topics K and heldout likelihood during training.
Line style indicates number of initial topics: dashed is K = 50, solid is K = 100.
                                                                                                  133


5.6.1    Toy bars dataset
We study a variant of the toy bars dataset of Griffiths and Steyvers (2004), shown in Fig. 5.6. There
are 10 ideal bar topics, 5 horizontal and 5 vertical. The bars are noisier than the original and cover
a larger vocabulary (900 words). We generate 1000 documents for training and 100 more for heldout
test. Each one has 200 tokens drawn from 1-3 topics.
   Fig. 5.6 shows many runs of all algorithms on this benchmark. Variational methods initialized
with 50 or 100 topics get stuck rapidly, while the Gibbs sampler finds a redundant set of the ideal
topics and is unable to effectively merge down to the ideal 10.
   In contrast, our MOdm method uses merges and deletes to rapidly recover the 10 ideal bars
after only a few laps. Without these moves, MOfix runs remain stuck at suboptimal fragments of
bars. Furthermore, our MOdm method initialized with the sampler’s final topics (fromGibbs) easily
recovers the ideal bars.


5.6.2    Academic and news articles
Next, we apply all methods to papers from the NIPS conference, articles from Wikipedia, and articles
from the journal Science (Paisley et al., 2011), with 80%-20% train-test splits. Online methods
process each training set in 20 batches. Trace plots in Fig. 5.7 compare predictive power and model
complexity as more data is processed. We summarize conclusions below.
   Anchor topics are good; variational is better. Using the anchor word method (Arora et al.,
2013) for initial topic-word parameters yields better predictions than random initialization (rand).
However, our methods can still make big, useful changes from this starting point. See Fig. 5.5 for
some examples.
   Deletes and merges make big, useful changes. Across all 3 datasets in Fig. 5.7, merges
and deletes remove many topics. On Wikipedia, we reduce 200 topics to under 100 while improving
predictions. Similar gains occur from the final result of the Gibbs sampler.
   Competitors get stuck or improve slowly. The Gibbs sampler needs many laps to make
quality predictions. The CRF method gets stuck quickly, while our methods (using the direct
assignment representation) do better from similar initializations. The stochastic split-merge method
(SOsm) grows to a prescribed maximum number of topics but fails to make better predictions.
This indicates problems with heuristic acceptance rules, and motivates our moves governed by exact
evaluation of a whole-dataset objective.
   Next, we analyze the New York Times Annotated Corpus: 1.8 million articles from 1987 to 2007.
We withhold 800 documents and divide the remainder into 200 batches (9084 documents per batch).
Fig. 5.7 shows the predictive performance of the more-scalable methods.
   For this large-scale task, our direct assignment representation is more efficient than the CRF
code released by Wang et al. (2011). With K = 200 topics, our memoized algorithm with merge
and delete moves (MOdm) completes 8 laps through the 1.8 million documents in the amount of
time the CRF code completes a single lap. No deletes or merges are accepted from any MOdm run,
                                                                                                                                                                                          134


                                                             NYTimes: D=1.8M

                                                                                                                                  Gibbs
                                    heldout log lik                                                                               SOsm rand
                                                                                                                                  crfSOfix rand
                                                  −7.6                                                                            SOfix rand
                                                                                                                                  MOfix rand
                                                                                                                                  MOdm rand
                                                                                                                                  MOdm spec
                                                      −7.7                                                                        MOdm fromGibbs
                                                             0         2                       4          6     8    10           Kinit=100
                                                         num pass thru data                                                       Kinit=200
                             NIPS: D=1392                                                             Wiki: D=7961                                               Science: D=13077
                      −7.4                                                                     −7.6                                                        −7.1
  heldout log lik


                                                                           heldout log lik


                                                                                                                                      heldout log lik
                      −7.6                                                                     −7.7                                                        −7.2
                                                                                                                                                           −7.3
                      −7.8                                                                     −7.8
                                                                                                                                                        −7.4
                      −8.0                                                                     −7.9
                                                                                                                                                           −7.5
                      −8.2                                                                     −8.0
                                                                                                                                                           −7.6
                      −8.4
                             0   100       200         300       400                                  0       100   200   300   400                               0   100   200   300   400
                            num pass thru data                                                       num pass thru data                                          num pass thru data
                      −7.4                                                                     −7.6                                                        −7.1
    heldout log lik


                                                                             heldout log lik


                                                                                                                                         heldout log lik


                      −7.6                                                                     −7.7                                                        −7.2
                                                                                                                                                           −7.3
                      −7.8                                                                     −7.8
                                                                                                                                                           −7.4
                      −8.0                                                                     −7.9
                                                                                                                                                           −7.5
                      −8.2                                                                     −8.0
                                                                                                                                                           −7.6
                      −8.4
                             50 100 150 200 250 300                                                  50 100 150 200 250 300                                       100 150 200 250 300
                                 num topics K                                                                 num topics K                                            num topics K
                      300                                                                      300                                                         300
                                                                                                                                      num topics K
                                                                           num topics K
  num topics K


                      250                                                                      250                                                         250
                      200                                                                      200
                                                                                                                                                           200
                      150                                                                      150
                      100                                                                                                                                  150
                                                                                               100
                       50                                                                                                                                  100
                                                                                                50
                             0   100      200          300       400                                  0       100   200   300   400                              0    100   200   300   400
                         num pass thru data                                                        num pass thru data                                        num pass thru data


   Figure 5.7: HDP topic model results on NIPS, Wikipedia, Science, and NYTimes datasets
Trace plots of predictive performance and number of topics K from Sec. 5.6.2.
                                                                                                  135


                              (a)                            (b)                    (c)


                              (d)                      (e)                          (f)


      Figure 5.8: Comparison of DP mixtures and HDP admixtures on 3.5M image patches
(a-b) Trace plots of number of topics and heldout likelihood. (c) Patches from the top 4 estimated
DP clusters. Each column shows 6 stacked 8 × 8 patches sampled from one cluster. (d-f) Patches
from 4 top-ranked HDP clusters for select test images from BSDS500 (Arbelaez et al., 2011). See
Sec. 5.6.3 for details.

likely because 1.8M documents require more than a few hundred topics. However, the acceptance
rate of sparsity-promoting restarts is 75%. With a more efficient, parallelized implementation, we
believe our variational approach will enable reliable large-scale learning of topic models with larger
K.


5.6.3     Image patch modeling
Finally, we study 8 × 8 patches from grayscale natural images as in Zoran and Weiss (2012). We
train on 3.5 million patches from 400 images, comparing HDP admixtures to Dirichlet process (DP)
mixtures using a zero-mean Gaussian likelihood. The HDP model captures within-image patch
similarity via image-specific mixture component frequencies. Both methods are evaluated on 50
heldout images scored via Eq. (5.59).
     Fig. 5.8 shows merges and deletes removing junk topics while improving predictions, justifying
the generality of these moves. Further, the HDP earns better prediction scores than the DP mixture.
We illustrate this success by plotting sample patches from the top 4 topics (ranked by topic weight π)
for several heldout images. The HDP adapts topic weights to each image, favoring smooth patches
for some images (d) and textured patches for others (e-f). The less-flexible DP must use the same
weights for all images (c).
                                                                                               136


5.7      Discussion
Variational inference for the HDP topic model faces numerous challenges not present in the DP
mixture model: the non-conjugacy in the model requires our new surrogate optimization objective,
while the presence of multiple random variables rˆd , πd in every document d leads to harsh local
optima problems even when the global parameters are known. We have developed a full-dataset
inference algorithm which can be scaled up with both memoized and stochastic techniques. Our
birth, merge, and delete proposals offer ways to escape some bad local optima and deliver more
reliable clustering results.
   Several issues remain open for the topic model, including the best way to initialize multinomial
topic models and removing some of the slow information propagation between document-level fre-
quency posteriors q(πd ) and the overall top-level frequency posterior q(π G (u)). We hope our work
is a first step toward reliable large-scale training of topic models.
Chapter 6

Scalable variational inference for
HDP hidden Markov models

The hidden Markov model (HMM) (Rabiner, 1989) has long been a fundamental building block of
modern unsupervised learning for sequential data, such as the motion capture segmentation task of
Fig. 1.4 or the chromatin segmentation task of Fig. 1.5. Here, we consider a Bayesian nonparametric
version of the hidden Markov model, with a hierarchical prior which shares statistical strength across
the transition parameters πk specific to each cluster k.
   Beal et al. (2001) introduced an early version of this model called the “infinite HMM”, which
was later formalized with the hierarchical Dirichlet process as the HDP-HMM by Teh et al. (2006).
Later, Fox et al. (2011) developed the “sticky” parameterization of the HDP-HMM, which encour-
ages models to have larger self-transition probabilities to better match empirical data. Recently,
(Johnson and Willsky, 2014) developed a stochastic variational inference algorithm for HDP-HMMs,
as well as more complex models with. However, while scalable, this approach did not investigate
proposal moves that adapt the number of clusters and used a point estimate for some top-level
allocation model parameters.
   In this chapter, we expand on earlier work published with collaborators William Stephenson and
Erik Sudderth in the NIPS 2015 conference (Hughes et al., 2015b). In particular, the contributions
of this work are:

Contribution 1: New optimization problem for the HDP-HMM with model selection ca-
pability.   The sticky parameterization of the HDP-HMM results in a required additional surrogate
bound beyond that used for the HDP topic model in Ch. 5. Previous work (Johnson and Willsky,
2014) used a point estimation strategy for π G (equivalently for u) which is problematic for model
selection. In contrast, by placing proper approximate posterior distributions on all random variables,
our surrogate bound yields effective model selection as we demonstrate in experiments.


                                                 137
                                                                                                      138


Contribution 2: Scalable memoized algorithm for the HDP-HMM. Previously, only
stochastic algorithms existed for this model. Our memoized approach offers a useful alternative
which avoids the sensitivity to learning rate schedules required by the stochastic approach.

Contribution 3: Birth, merge, and delete proposals to escape local optima. We show
that adaptive proposals which add or remove clusters offer improvements in both model quality
(as measured by our objective function) and in application-specific metrics like Hamming distance.
Our extensive experiments show promising results across speaker diarization, motion capture, and
epigenomic segmentation tasks.

Roadmap. We first specify the model, focusing on the interesting model features relative to the
HDP topic model: the sequential structure of p(z|π) and the sticky parameterization of the hierarchy
of transition probabilities. Next, we set up the variational optimization problem and derive complete
expressions for the objective function. Later, we give complete descriptions of coordinate ascent
updates as well as both full-dataset and memoized algorithms. We follow with experimental results
applying these training algorithms on several datasets, and conclude with discussion of open problems
of interest to future work.


6.1      Hierarchical Dirichlet Process Hidden Markov Models
We wish to jointly model data from D separate sequences, where sequence d has data xd =
[xd1 , xd2 , . . . , xdTd ] and observation xdt is a vector representing some measurement at timestep
t. We assume that these timesteps occur at regular intervals. For example, vector xdt could be the
spectrogram of an instant of audio, or the sensed joint positions of a human subject during a 100ms
interval of the motion capture application illustrated in Fig. 1.4.
   The HDP-HMM explains this data by assigning each observation xdt to a single hidden state zdt .
The chosen state comes from a countably infinite set of possible cluster labels k ∈ {1, 2, . . .}. How-
ever, the states at neighboring timesteps are not drawn independently. Instead, we assume that the
whole sequence zd = [zd1 zd2 . . . zdt . . . zdTd ] is generated via first-order Markovian dynamics. These
are parameterized by initial state probabilities π0 and transition probabilities {πk }∞
                                                                                      k=1 . These global
parameters themselves have a common prior distribution, which means they are related hierarchi-
cally. The entire graphical model of directed dependence relationships between hidden variables is
shown in Fig. 6.1.


6.1.1     Generative model for each sequence.
To generate each sequence, we require two sets of global variables. First, the global cluster shape
parameters {φk }∞
                k=1 of the observation model, which we assume come from their common prior
in Sec. 2.1.3. Second, the Markovian transition parameters, whose generative story we provide in
Sec. 6.1.2.
                                                                                                       139


                                        ˆk
                                        ω
                                       ˆ k uk K
                                       u
                           πG
                                                       K
              π0       θˆ0               θˆk     πk

             zd1                 zd2 sˆ zd3                         ·sˆ· · zdT
                       sˆd1                  d2                        d,T -1

             xd1                 xd2               xd3 · · · xdT
                                                                                           D

                                         νˆk φk
                                         τˆk    K

  Figure 6.1: Directed graphical representation of the HDP hidden Markov model (HDP-HMM).
The diagram shows the fundamental random variables (circled nodes), hyperparameters (gray), and
variational free parameters (red) of the hierarchical Dirichlet process (HDP) hidden Markov model
(HDP-HMM). This Bayesian nonparametric model generates a countably infinite number of clusters
under the prior, of which some number K ≤ N are assigned to data. This model has the same
global parameters for cluster conditional probabilities u and shape parameters φ as the mixture
model, as well as a additional transition probabilities πk and starting-state probabilities π0 . The
conditional probabilities u are deterministically mapped to top-level global cluster probabilities π G
via the invertible stick-breaking transformation. The dataset x consists of N total observations
divided into D sequences, and within a sequence we assume Markovian structure exists among the
assigned label sequence zd1 , zd2 , . . . zdT . Our variational representation captures this with pair-wise
joint probabilities sˆdt for each timstep.


   Given these global parameters, we allocate the state sequence zd by first drawing its first
timestep’s assignment from the initial distribution π0 , and then recursively using the transition
distribution specified by each chosen cluster at timestep t − 1 to determine the cluster at timestep t:

             zd1 ∼ Cat∞ (π0 ),      zd2 |zd1 ∼ Cat∞ (πzd1 ),     zdt |zdt−1 ∼ Cat∞ (πzd,t−1 ).       (6.1)

Given the assigned cluster label zdt at timestep t, we then draw the observed data xdt independently
                                                                                                    140


of any other variable in the sequence using a cluster-specific likelihood:

                                        xdt |zdt = k ∼ L(xdt |φk )                                 (6.2)

where L is our general exponential family likelihood distribution from Sec. 2.1.3.

First-order autoregressive models. As a small extension, we can allow the observation at
timestep t to depend on the observation from timestep t−1. This first-order auto-regressive behavior
is formally defined as:

                                  xdt |zdt = k ∼ N (xdt |Ak xdt−1 , Σk )                           (6.3)

Such autoregressive likelihoods belong to the exponential family, admit conjugate priors, and are
amenable to all our inference techniques. See Fox (2009) for a detailed discussion of these auto-
regressive models, as well as MCMC training algorithms for these models. Bayesian posterior analysis
of such autoregressive processes goes back at least to Quintana and West (1987). While the graphical
model in Fig. 6.1 does not depict auto-regressive behavior explicitly, our inference procedure easily
supports this extension.


6.1.2    Hierarchical prior on transition probabilities via the HDP.
Under the HDP-HMM prior and posterior, the number of clusters or states is unbounded. Each
cluster k has its own infinite transition probability vector πk = [πk1 πk2 . . .], which must be non-
negative and sum-to-one. Without careful regularization, the sheer number of parameters in the
infinite transition matrix π makes it very possible to overfit to training datasets. We impose such
regularization by having each transition vector πk share a common prior.
   As in the HDP topic model, let π G define a global probability vector over the infinite state space.
We can write this quantity as finite vector representing K active topics and the aggregate mass of
all inactive topics with index larger than K: π G = [π1G π2G . . . πK
                                                                    G G
                                                                      π>K ]. Then, given this common set
of probabilities, we generate the transition vector πk for cluster k as:

                                 πk ∼ Dir(απ1G , απ2G , . . . απK
                                                                G     G
                                                                  , απ>K)                          (6.4)

This regularizes the vector πk by setting its mean to π G but allowing some flexibility governed by
the concentration α > 0. Any setting of α < K will typically encourage the vector πk to be sparse,
with only a few entries with mass significantly greater than zero. Typically, we set α ≈ 0.5.
   A similar prior is given to the starting state distribution π0 :

                               π0 ∼ Dir(α0 π1G , α0 π2G , . . . α0 πK
                                                                    G       G
                                                                      , α0 π>K)                    (6.5)

We generally set α0 >> α because very few starting states are observed and thus a smaller variance
(greater concentration) is needed. Typically, we have α0 ≈ 10.
   This hierarchical relationship between the top-level probabilities π G and the cluster-level proba-
bilities πk is formally a realization of the hierarchical Dirichlet process (Teh et al., 2006). The HDP
                                                                                                                  141


prior specification is complete by defining the stick-breaking prior on π G . Given independent stick-
breaking weights uk ∼ Beta(1, γ), for each cluster, we use the stick-breaking transform of Eq. (3.4)
to completely determine the K active entries of π G :
                                                                          k−1
                                                                          Y
                               π1G (u)   = u1 ,         πkG (u)   = uk           (1 − uℓ ).                      (6.6)
                                                                           ℓ=1


6.1.3     Sticky self-transition bias.
In many applications of clustering for time-series or sequential data, we expect assigned clusters to
persist for many timesteps. For example, a human may walk for several minutes before switching to
another activity, or the genome may exhibit long inactive segments in between transcription sites.
The “sticky” parameterization of the HDP-HMM introduced by Fox et al. (2011) places extra mass
on each cluster’s self-transition probability πkk compared to the conventional prior:

                                                              G
                     [πk1 . . . πk>K ] ∼ Dir(απ1G , . . . , απk−1 , απkG + κ, . . . απ>
                                                                                      G
                                                                                       K)                        (6.7)

Here, the scalar hyperparameter κ > 0 controls the degree of self-transition bias. Choosing κ >> 1
allows the model to place much higher probability on long periods of high self-transitions in observed
sequences. Alternatives that use explicit duration distributions or related semi-Markovian models
have been thoroughly discussed in Johnson and Willsky (2014), but we find the sticky parameteri-
zation to be effective while also affordable.

Global and local variables

By inspection of Fig. 6.1, the only local (data-attached) hidden variables in this model are the
sequence of assignments zd for every sequence d. We recognize several global parameters for the
HDP-HMM. First, the cluster shape parameters {φk }K
                                                  k=1 remain the obvious global parameters of
the observation model, just like in the HDP topic model. Second, the allocation model includes
several global parameters: the top-level conditional probabilities {uk }K
                                                                        k=1 as well as the Markovian
dynamics probabilities: starting vector π0 and transition vectors {πk }K
                                                                       k=1 .


6.2      Posterior inference as a variational optimization problem
6.2.1     Mean-field approximate posterior
Our ideal goal in posterior inference is to estimate the joint distribution p(u, π, φ, {z}D     D
                                                                                          d=1 |xd=1 ) given
D observed sequences. However, because this joint posterior is intractable we appeal to mean-field
variational methods to find the closest member of the factorized family:
                                      ∞
                                      Y                         ∞
                                                                Y                ∞
                                                                                 Y                D
                                                                                                  Y
                    q(u, π, φ, z) =         q(uk ) · q(π0 ) ·         q(πk ) ·         q(φk ) ·         q(zd )   (6.8)
                                      k=1                       k=1              k=1              d=1
                                                                                                             142


                                           ∞
                                           Y
                              q(φ) =             P(φk |ˆ
                                                       τk , νˆk )                                           (6.9)
                                           k=1
                                           Y∞
                              q(u) =             Beta(uk |ˆ  ˆ k , (1 − u
                                                          uk ω          ˆk )ˆ
                                                                            ωk )                           (6.10)
                                           k=1

                             q(π0 ) = Dir(π0 |θˆ01 , θˆ02 , . . . θˆ0K , θˆ0>K )                           (6.11)
                             q(πk ) = Dir(πk |θˆk1 , θˆk2 , . . . θˆkK , θˆk>K )                           (6.12)
                                           D
                                           Y                         Td
                                                                     Y
                              q(z) =             Cat∞ (zd1 |ˆ
                                                            rd1 )          Cat∞ (zdt |ˆ
                                                                                      szdt−1 )             (6.13)
                                           d=1                       t=2

Our chosen factorization for q is similar to Johnson and Willsky (2014), but includes a proper
approximate posterior for q(u) rather than a point-estimate. Just like our earlier methods for the
DP mixture and HDP topic model, we do not require any truncation assumption beyond that for
q(z) detailed below.

Approximate posterior factor for each state sequence

Our choice for the assignment sequence approximate posterior q(z) is not a full mean-field approach,
but rather keeps the Markov structure of each sequence d:
                              "∞          # −1 ∞ ∞ 
                               Y δ (z ) TY  d   Y Y sˆntkℓ δk (znt )δℓ (zn,t+1 )
                     q(zd ) ,       k
                                 rˆd1k d1
                                                                                                           (6.14)
                                           t=1
                                                        rˆntk
                                     k=1                       k=1 ℓ=1

This factor is defined by two related free parameters. First, free parameter sˆdt defines the joint
assignment probabilities for each pair of assignments zdt , zd,t+1 at adjacent timesteps. Formally,
sˆdtkℓ , q(zd,t+1 = ℓ, zdt = k). Thus, the parameter sˆdt has infinitely many rows and columns, but
its entries are all non-negative and must sum to one.
   Next, the parameter vector rˆdt defines the marginal probability of assignment at each timestep
t. That is: rˆdtk , q(znt = k). To form a valid probability distribution for the whole sequence, q(zd ),
both rˆd and sˆd vector must obey several constraints. First, at each timestep t ∈ 1, 2, . . . Td − 1 the
adjacent pair joint probabilities must obey a sum-to-one constraint and non-negativity constraint:
                ∞ X
                X ∞
                          sˆdtkℓ = 1,          sˆdtkℓ ≥ 0 for all (k, ℓ) ∈ {1, 2, . . .} × {1, 2, . . .}   (6.15)
                k=1 ℓ=1

Second, the marginal vector rˆdt at each timestep t = 1 . . . Td must obey similar constraints, as well
a direct constraint that entry k of rˆdt equals the sum over row k in the joint probability matrix sˆdt :
                               ∞
                               X
                                     rˆdtk = 1,           rˆdtk ≥ 0 for all k ∈ 1, 2, . . .                (6.16)
                               k=1
                                                 ∞
                                                 X
                                     rˆdtk =           sˆntkℓ for all k ∈ 1, 2, . . .
                                                 ℓ=1
                                                                                                          143


Truncation to K active clusters.

Just like in HDP topic models and DP mixture models, we enforce the assumption that only the first
K active clusters have any non-zero probability. All remaining inactive clusters with index k > K
must have zero mass and thus need not be explicitly represented. Thus, at each timestep t we need
to specify the K × K matrix sˆdt and the K-length vector rˆdt .

Approximate posterior q(φk ) for global cluster shape

Like every other model we’ve discussed, each cluster k in our countably infinite set is given an
independent posterior factor q(φk ) for its shape parameter, which has form given by the conjugate
prior density P and two global free parameters: pseudo-count νˆk and shape parameter τˆk . For more
information, see Sec. 2.1.3.

Approximate posterior q(uk ) for global cluster frequencies

As in HDP topic models, each cluster k in our countably infinite set has an independent posterior
factor q(uk ). We give q(uk ) a Beta distribution, parameterized so u
                                                                    ˆk gives the mean value of uk and
ˆ k gives the variance, just as in the HDP topic model from Ch. 5, especially Sec. 5.2.1.
ω

Approximate posterior q(πk ) for transition probabilities.

Each cluster k has a global parameter πk indicating how probable a transition from k to any other
cluster is. We learn an approximate posterior for πk which is Dirichlet with K + 1 parameters
θˆk1 , . . . θˆkK , θˆk>K . We interpret each θˆkℓ as a non-negative pseudo-count of how often we will tran-
sition from k to state ℓ. The starting state probability vector π0 also has a similar K + 1-length
vector named θˆ0 .
    Under this chosen approximate posterior, we have the expectations:

                                    θˆkℓ                                      P
                      Eq [πkℓ ] = PK+1     ,     Eq [log πkℓ ] = ψ(θˆkℓ ) − ψ( K+1  ˆ
                                                                                j=1 θkj )               (6.17)
                                         ˆ
                                   j=1 θkj


6.2.2       Evidence lower-bound objective function
Given the assumed factorization over the global random variables φ, u, {πk }∞
                                                                            k=0 and the local random
variables z, we now set up an optimization problem to find the free parameters whose implied
factorized posterior minimizes the KL divergence to the true posterior.
                                                                                              
     arg min KL q φ, u, {πk }K            D                   ˆ sˆ 
                                                                   p φ, u, {πk }K          D
                               k=0 , {zd }d=1 | τ
                                                ˆ, νˆ, u
                                                       ˆ, ω
                                                          ˆ , θ,                 k=0 , {zd }d=1 | x     (6.18)
   τˆ,ˆ  ˆ ,ˆ
      ν, u     ˆ s
            ω, θ, ˆ
                                                                                                                         144


Following the standard logic used in deriving the tractable optimization problems in Ch. 3 and Ch. 5,
we have an equivalent optimization problem:

    arg max L(x, τˆ, νˆ, u
                         ˆ, ω   ˆ sˆ)
                            ˆ , θ,                                                                                     (6.19)
  τˆ,ˆ  ˆ,ˆ
     ν, u     ˆ sˆ
          ω , θ,
                                                                                                                 
                     L(x, τˆ, νˆ, u
                                  ˆ, ω   ˆ sˆ) , log p(x) − KL q(φ, u, π, z|ˆ
                                     ˆ , θ,                                    τ , νˆ, u
                                                                                       ˆ, ω   ˆ sˆ)
                                                                                          ˆ , θ,   p(φ, u, π, z|x)    (6.20)
                                                            h                                            i
                                               = Eq(φ,u,π,z) log p(x, φ, u, π, z) − log q(φ, u, π, z)                  (6.21)

Under our chosen factorized posterior q, the expectation that defines L in the last line is computable
via closed-form functions of the free parameters. This makes optimization possible.
   Remembering that the marginal responsibilities rˆ can be found as a deterministic function of the
pairwise joint responsibilities sˆ, we can write this objective as a sum of two terms:

                             L(x, τˆ, νˆ, u
                                          ˆ, ω   ˆ sˆ) = Ldata (x, τˆ, νˆ, rˆ(ˆ
                                             ˆ , θ,                           s)) + Lalloc (ˆ
                                                                                            u, ω   ˆ sˆ)
                                                                                               ˆ , θ,                  (6.22)

The data term Ldata is computed just as it was for HDP topic models and DP mixture models in
Eq. (2.95). We need not reproduce it here.
   We will focus on evaluating the allocation model term. This term itself decomposes into three
interpretable pieces:

               Lalloc (ˆ
                       u, ω   ˆ sˆ) = Lentropy (ˆ
                          ˆ , θ,                s) + LHDP-trans (ˆ  ˆu
                                                                 s, θ, ˆ) + Lsurrogate-HDP-top (ˆ
                                                                                                u, ω
                                                                                                   ˆ)                  (6.23)

We now briefly define each term via expectations.

Entropy term of the assignment posterior q(z)

The entropy term is a simple function of the joint timestep probabilities sˆ.
                                                                PD
                                                     s) , −
                                           Lentropy (ˆ             d=1 Eq    [log q(zd )]                              (6.24)
                                                              D X
                                                              X K X
                                                                  K
                                                          =                  Hdkℓ (ˆ
                                                                                   s),
                                                              d=1 k=0 ℓ=1

where for each state index k = {0, 1, . . . K} and receiving state index ℓ = {1, 2, . . . K, K + 1} we
have:
                                        
                                        −ˆ
                                          rdt1k log rˆdt1ℓ ,                         if k = 0
                           Hdkℓ (ˆ
                                 sd ) =                                                                                (6.25)
                                        − PTd −1 sˆdtkℓ log PKsˆdtkℓ                if k = 1, 2, . . . K
                                             t=1                   ˆ
                                                                   s         dtkj
                                                                       j=1
                                                                                                            145


Transition-specific term

The HDP-HMM has a term similar to Lalloc which aggregates all pieces of the objective that are
             ˆ This term is:
functions of θ.
                                    h             K               K
                          ˆu
                                                  X     p(πk ) i X
           LHDP-trans (ˆ
                       s, θ, ˆ) = Eq log p(z|π) +   log         −   E[cDir (αk π G (u))]                  (6.26)
                                                        q(πk )
                                                          k=0                  k=0
                                   K
                                   X
                               =         −cDir (θˆk1 . . . θˆkK θˆk>K )
                                   k=0
                                       K K+1
                                       X X                                              
                                   +                Mkℓ (ˆ            u) + κδk (ℓ) − θˆkℓ Eq [log πkℓ ]
                                                         s) + αk πkG (ˆ
                                       k=0 ℓ=1

where cDir is the cumulant function of a Dirichlet distribution in Eq. (5.22), the expectation of
log πkℓ is given in Eq. (6.17), and we define the transition count summary statistic Mkℓ to given the
total number of expected transitions from cluster k to cluster ℓ:
                                  P
                                   D rˆdt1ℓ             if k = 0
                                       d=1
                       Mkℓ (ˆ
                            s) = P                                                                        (6.27)
                                   D PTd −1 sˆ          if k ∈ {1, 2, . . . K}
                                       d=1  t=1    dtkℓ


Surrogate lower bound for top-level term.

As in the HDP topic model, computing the expected value of Eq(u) [cDir (π G (u))] under the chosen
beta distribution form for q(u) has no known closed form. We thus use the surrogate bound from
Eq. (5.24) to write this term as a tractable function of u. Substituting this lower bound in and
assuming a non-sticky HDP prior with κ = 0, we have:
                                                 XK
                                            p(u)                         
        Lsurrogate-HDP-top (ˆ  ˆ ) ≤ Eq log
                            u, ω                  +   Eq cDir (αk π G (u))                                (6.28)
                                            q(u)
                                                             k=0

         Lsurrogate-HDP-top (ˆ  ˆ ) = K log α0 + K 2 log α
                             u, ω
                                           K
                                           X
                                       +         cBeta (1, γ) − cBeta (ˆ  ˆ k , (1 − u
                                                                       uk ω          ˆk )ˆ
                                                                                         ωk )
                                           k=1
                                           K 
                                           X                         
                                       +          (K+1) + 1 − uˆk ω
                                                                  ˆ k Eq [log uk ]
                                           k=1
                                           K 
                                           X                                     
                                       +                                       ωk Eq [log 1 − uk ]
                                                  (K+1)(K+1−k) + γ − (1 − uˆk )ˆ
                                           k=1

This surrogate function can be evaluated in closed-form given u
                                                              ˆ, ω
                                                                 ˆ free parameters and the hyper-
parameters α, α0 , γ. For sticky values κ > 0, we need an alternative expression, which is derived in
the next section.


6.2.3    Surrogate objective for sticky HDP-HMM
To consider the sticky parameterization of the HDP prior on the transition parameters {πk }K
                                                                                           k=0
given the global probabilities π G (u), a small change to the top-level surrogate term is required.
                                                                                                                         146


   Previously, we had:
                                                                        K+1
                                                                        X
                                      cDir (απ G (u)) ≥ K log α +             log πℓG (u)                              (6.29)
                                                                        ℓ=1

Now, when we compute the cumulant function of the vector απ G (u) plus an extra point mass κ on
the k-th entry of the vector. In the supplement of (Hughes et al., 2015b), we establish the following
bound for any κ > 0, α > 0:
                                                                                      PK+1
    cD (απ G (u) + κδk ) ≥ K log α − log(α + κ) + log(απkG (u) + κ) +                       ℓ=1 ℓ6=k   log(πℓG (u)).   (6.30)

The term log(απkG (u) + κ) still has no tractable expectation under q(u), so we apply a further bound
given the concavity of the log function:

                            log(απkG (u) + κ) ≥ πkG (u) log(α + κ) + (1 − πkG (u)) log κ.                              (6.31)
                                                                            
                                              = log κ + log(α + κ) − log κ πkG (u)                                     (6.32)

Combining Eqs. (6.30) and (6.31), then unpacking the vector π G (u) into a pure function of the u
                                                 P
variables, and computing the whole expectation of K              G
                                                   k=0 cDir (αk π (u) + κδk ), we have:
                                                                           K
                                            p(u)                         X                              
Lsurrogate-HDP-sticky-top (ˆ  ˆ ) ≤ Eq log
                           u, ω                    + Eq cDir (α0 π G (u)) +     Eq cDir (αk π G (u) + κδk )
                                            q(u)
                                                                            k=1
                                                  2
                                                                                   
Lsurrogate-HDP-sticky-top (ˆ  ˆ ) = K log α0 + K log α + K log(κ) − log(α + κ)
                           u, ω                                                                       (6.33)
                                            K
                                            X
                                        +         cBeta (1, γ) − cBeta (ˆ  ˆ k , (1 − u
                                                                        uk ω          ˆk )ˆ
                                                                                          ωk )
                                            k=1
                                                            P
                                                                K    G
                                        + log(α + κ) − log(κ)   k=1 πk (ˆ
                                                                        u)
                                            K 
                                            X                 
                                        +          K +1−u  ˆ k Eq [log uk ]
                                                        ˆk ω
                                            k=1
                                            K 
                                            X                                     
                                        +                                       ωk Eq [log 1 − uk ]
                                                   K(K + 1 − k) + γ − (1 − uˆk )ˆ
                                            k=1

Directly cross-referencing the non-sticky surrogate term in Eq. (6.28) and the sticky surrogate term
in Eq. (6.33) shows that not much has changed and the overall complexity of evaluating the surrogate
top-level objective remains the same.


6.3        Update steps for variational optimization
Using the objective functions defined above, we can write down the the constrained optimization
problem for our free parameters τˆ, νˆ, u
                                        ˆ, ω   ˆ sˆ given an observed dataset x and hyperparameters H:
                                           ˆ , θ,

   arg max          Ldata (x, τˆ, νˆ, rˆ(ˆ
                                         s)) + Lentropy (ˆ
                                                         s) + LHDP-trans (ˆ  ˆu
                                                                          s, θ, ˆ) + Lsurrogate-HDP-sticky-top (ˆ
                                                                                                                u, ω
                                                                                                                   ˆ)
   τˆ,ˆ
      ν ,ˆ
         u,ˆ  ˆs
           ω ,θ,ˆ

                                                                                                                       (6.34)
                                                                                                                   147


where the required constraints on the local free parameters for document d are:
                                  PK      PK
                   sˆdt ≥ 0 and     k=1    ℓ=1 s
                                               ˆdtkℓ     =1                             for t = 1, 2, . . . Td   (6.35)

and the required constraints on global free parameters are

                   ˆk ∈ [0, 1] and ω
                   u               ˆk ≥ 0                         for k = 1, 2, . . .                            (6.36)
                      νˆk ≥ 0 and τˆk ∈ M                         for k = 1, 2, . . .
                         θˆjk ≥ 0                                 for j, k ∈ {0, 1, 2, . . .} × {1, 2, . . .}

As with earlier models, we pursue a block-coordinate ascent algorithm, which proceeds in two steps,
a local step and a global step. The local step updates the free parameters sˆd for each sequence d
while holding global parameters fixed. The global step updates the global free parameters given
fixed (summary statistics of) local parameters. When applied iteratively, these steps are guaranteed
to monotonically improve the whole objective L until it converges to a local optima. We emphasize
again that due to the required surrogate bound term, which is Lsurrogate-HDP-top when κ = 0 or
Lsurrogate-HDP-sticky-top for κ > 0, the objective function L we are optimizing is a strict lower bound
of log p(x) − KL(q||p).
   Below, we define the optimization problem solved by each step of the block-coordinate ascent
algorithm. We then describe detailed solutions which solve each problem below in closed-form in
the following sections.


6.3.1    Global parameter update step for observation model
Consider the optimization problem:

         arg max Ldata (x, τˆ, νˆ, rˆ)         subject to νˆk ≥ 0 and τˆk ∈ M for k = 1, 2, . . . .              (6.37)
            τˆ,ˆ
               ν

Where we remember that for the HDP-HMM, the responsibilities rˆ at each timestep are a determin-
istic function of the pair-wise responsibilities sˆ. The optimization problem in Eq. (5.32) is the same
as for the observation model in the DP mixture model, found in Eq. (3.39). This, we can apply the
solution from Eq. (3.41) directly.

                                                  τˆk∗ = Sk (x, rˆ) + τ¯                                         (6.38)
                                                  νˆk∗   = Nk (ˆ
                                                               r ) + ν¯


6.3.2    Global step for allocation model
The allocation model global step involves two substeps: (1) finding the optimal top-level parameters
u
ˆ, ω                                                     ˆ Below, we describe our solutions to each
   ˆ , and (2) finding the optimal transition parameters θ.
of these subproblems, and an approach to the complete problem.
                                                                                                               148


Global step for transition parameters

We now consider finding the optimal θˆ parameters for the problem: Consider the optimization
problem

                  arg max LHDP-trans (ˆ  ˆ uˆ)
                                      s, θ,                                                                  (6.39)
                        θˆ

             subject to θˆjk ≥ 0 for                         j = 0, 1, 2, . . . K,   k = 1, 2, . . . K + 1

Using standard exponential family mathematics, we find a closed-form solution for any value of
κ ≥ 0:

                                       θˆjk
                                         ∗
                                                   s) + απkG (ˆ
                                            = Mjk (ˆ          u) + κδj (k)                                   (6.40)

which naturally shares much in common with the update for document-level θˆ parameters from
the HDP topic model, but differs in the use of the pair-wise count statistics Mjk (ˆ
                                                                                   s) and the sticky
hyperparameter κ.

Global step for top-level parameters

Consider the optimization problem

               arg max LHDP-trans (ˆ  ˆu
                                   s, θ, ˆ) + Lsurrogate-HDP-top (ˆ
                                                                  u, ω
                                                                     ˆ)                                      (6.41)
                  ˆ,ˆ
                  u ω

                      ˆk ∈ [0, 1] for
           subject to u                                                                k = 1, 2, . . .       (6.42)
                         ˆ k ≥ 0 for
                         ω                                                             k = 1, 2, . . .       (6.43)

Due to non-conjugacy, we cannot update u
                                       ˆ, ω
                                          ˆ in closed form. However, we can apply the same
strategies as in the HDPTopicModel. First, we have a heuristic update for ω
                                                                          ˆ , which follows the
same derivation pattern as the similar update in Eq. (5.39)
                          
                          
                          
                           (K + 1)(K + 2 − k) + 1 + γ if k ≤ K and κ = 0
                          
                    ω
                    ˆ k = K(K + 2 − k) + 1 + γ             if k ≤ K and κ > 0                                (6.44)
                          
                          
                          
                          1 + γ                           if k > K

Second, we have a numerical optimization update for u
                                                    ˆ. The objective changes slightly depending
on whether we have a non-sticky model with κ = 0 or a sticky state transition model with κ > 0.
For both cases, we provide the exact objective needed below. These functions can be provided to
                                                        ˆ∗ within the constraint set.
any numerical optimization technique to find an optimal u
                                                                                                             149


Function for u
             ˆ with non-sticky model κ = 0:
                                      K
                                      X
      f (ˆ
         u, ω      ˆ α, α0 , γ) = −
            ˆ , T (θ),                        cBeta (ˆ  ˆ k , (1 − u
                                                     uk ω          ˆk )ˆ
                                                                       ωk )                                (6.45)
                                      k=1
                                   K
                                   X
                               +         (K+1 + 1 − uˆk ω
                                                        ˆ k )[ψ(ˆ  ˆ k ) − ψ(ˆ
                                                                uk ω         ωk )]
                                   k=1
                                   K
                                   X
                               +         ((K+1)(K+1−k) + γ − (1 − u   ωk )[ψ((1 − u
                                                                  ˆk )ˆ               ωk ) − ψ(ˆ
                                                                                  ˆk )ˆ        ωk )]
                                   k=1
                                   K
                                   X          K+1
                                              X
                               +         αj          πkG (ˆ      ˆ
                                                          u)Pjk (θ)
                                   j=0         k=1

where we define the K + 1-dimensional log probability statistic Pj for the starting state (j = 0) as
well as each active state 1 ≤ j ≤ K:

                                   ˆ , Eq [log πjk ] = ψ(θˆjk ) − ψ(PK+1 θˆjℓ )
                              Pjk (θ)                                                                      (6.46)
                                                                     ℓ=1


Function for u
             ˆ with sticky model κ > 0:
                                              K
                                              X
        f (ˆ
           u, ω      ˆ α, α0 , κ, γ) = −
              ˆ , P (θ),                            cBeta (ˆ  ˆ k , (1 − u
                                                           uk ω          ˆk )ˆ
                                                                             ωk )                          (6.47)
                                              k=1
                                         K
                                         X
                                    +          (K + 1 − uˆk ω
                                                            ˆ k )[ψ(ˆ  ˆ k ) − ψ(ˆ
                                                                    uk ω         ωk )]
                                         k=1
                                         K
                                         X
                                    +          (K(K+1−k) + γ − (1 − uˆk )ˆ
                                                                         ωk )[ψ((1 − uˆk )ˆ
                                                                                          ωk ) − ψ(ˆ
                                                                                                   ωk )]
                                         k=1
                                         K
                                         X          K+1
                                                    X
                                    +          αj          πkG (ˆ      ˆ
                                                                u)Pjk (θ)
                                         j=0         k=1
                                                                      P               
                                                                         K       G
                                    + (log(α + κ) − log κ)                     π
                                                                            k=1 k  (ˆ
                                                                                    u )

Complete global step for allocation model

Updating u                            ˆ while updating θˆ requires the quantity π G (ˆ
         ˆ requires the statistics P (θ),                                            u). Because
of the interdependence of these updates, we recommend that if possible they be iterated several
times during the complete global step to improve convergence. That is, given updated values of the
assignments sˆ, we first update θˆ via Eq. (6.40), and then alternate between u
                                                                              ˆ and θˆ until convergence.


6.3.3     Local step to update assigned state sequence
The update for assignment resposibility parameters sˆd for each sequence d can be processed inde-
pendently. However, within a sequence d we need to find a jointly optimal value across all timesteps
 sdt }Tt=1
{ˆ      d
           via dynamic programming (Beal, 2003). The forward-backward algorithm (Rabiner, 1989)
takes two input parameters. First, for the d-th sequence we compute a matrix Cd which has Td rows
                                                                                                             150


and K columns. Each entry defines the expected likelihood of the data xdt at timestep t under cluster
                                                                                               ˆ K for
k: Cdtk = Eq [log p(xdt | φk )]. Second, we require the log transition probability vector {Pj (θ)}     j=0
each possible active cluster as well as the starting state (j = 0). Given these values, the algorithm
produces the optimal probabilities sˆd for the sequence under the objective in Eq. (6.48).

                   arg max Ldata (xd , τˆ, νˆ, rˆd (ˆ
                                                    sd )) + Lentropy (ˆ
                                                                      sd ) + LHDP-trans (ˆ    ˆ uˆ)
                                                                                         sd , θ,       (6.48)
                      sˆd
                                               X
                                  subject to          sˆdtkℓ = 1 and sˆdt ≥ 0 for t = 1, 2, . . . Td
                                                k,ℓ

From sˆd , we can easily compute the marginal responsibilities rˆd via summation.

Runtime cost.        The dynamic programming required here has cost O(Td K 2 ) for the d-th sequence,
where K is the number of active clusters and Td is the number of timesteps in the sequence. For
efficiency, multiple sequences can be processed in parallel.


6.4      Variational algorithms with fixed number of topics
6.4.1     Full-dataset algorithm
A complete full dataset coordinate ascent algorithm is given in Alg. 5.3. It alternates between
visiting all D sequences in the local step and then updating all the global free parameters in the
global step. The runtime cost is dominated by the local step, where each dynamic programming has
cost O(Td K 2 ).


6.4.2     Memoized variational algorithm
The full dataset algorithm can be easily extended to process one small batch of sequences Db during
each local step instead of all D sequences. Assuming that all D sequences have been divided into B
total batches before iterations begin, we can apply the memoized algorithm in Alg. 5.5.

Memoized algorithm and monotonicity guarantee

The memoized algorithm for training the HDP-HMM in Alg. 6.2 is guaranteed to monotonically
increase the objective function L after both the global step and the local step. This is because
unlike HDP topic models, which have multiple local variables, the HDP-HMM has only the local
pairwise responsibilities sˆ. The forward-backward algorithm delivers optimal values for the local
parameter sˆ under a fixed set of global parameters. Thus, the local step is guaranteed to improve
the objective L.

Memoized evaluation of the objective

As in DP mixture models or HDP topic models, we can compute the objective score L for the
current configuration of global parameters and global statistics of local parameters after visiting
                                                                                                         151


Algorithm 6.1 Variational coordinate ascent for HDP-HMM
Input:
    {xd }D  d=1 : dataset with D sequences, each of length Td
    K: truncation level
    {ˆτk , νˆk }Kk=1 : initial global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : initial global parameters of allocation model defining q(u)
    {{θˆjk }K+1       K
               k=1 }j=0 : initial global parameters defining Markov transition probabilities
    γ, α, α0 : allocation model hyperparameters
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : updated global parameters of allocation model
         ˆ     K+1 K
    {{θjk }k=1 }j=0 : updated global parameters defining Markov transition probabilities.
 1: function VariationalCoordAscentForHDPHMM(x, K, τ                       ˆ, νˆ, u
                                                                                  ˆ, ω   ˆ
                                                                                     ˆ , θ)
 2:       while not converged do
 3:            for d ∈ 1, 2, . . . D do                                            ⊲ Local step at document d
 4:                 for t ∈ 1, 2, . . . Td do
 5:                        for k ∈ 1, 2, . . . K do
 6:                           Cdak ← Eq [log p(xda |φk )]
 7:                 sˆd , rˆd ← FwdBwdAlgForSeq(Cd (ˆ                  ˆ
                                                           τ , νˆ), P (θ))

 8:          for k ∈ 1,P
                       2, . . . K    do                                                    ⊲ Summary step
                         D PTd
 9:             Sk ← d=1 t=1             rˆ s(x )
                       P         PTd dtk dt
10:             Nk ← D    d=1 P       t=1 r
                                          ˆdtk
                        P   D          Td
11:             M0k ← d=1 t=1 rˆd1k
12:             for j ∈ 1, 2,P  . . . KP do
13:                 Mjk ← D        d=1
                                            Td
                                            t=1 s
                                                ˆdtjk

14:          for k ∈ 1, 2, . . . K do                           ⊲ Global step for observation parameters
15:             τˆk ← Sk + τ¯
16:             νˆk ← Nk + ν¯

17:          while not converged do
18:            for j ∈ 0, 1, 2, . . . K do                       ⊲ Global step for transition parameters
19:                for k ∈ 1, 2, . . . K + 1 do
20:                   θˆjk ← Mjk + αj πkG (ˆ  u) + δk κ
21:                for k ∈ 1, 2, . . . K + 1 P
                                             do
22:                   Pjk ← ψ(θˆjk ) − ψ( K+1 θˆjℓ )    ℓ=1
23:              uˆ ← arg maxuˆ f (ˆu, ω      ˆ α, γ, κ)
                                       ˆ , P (θ),                              ⊲ Global step, via L-BFGS
      return τˆ, νˆ, u
                     ˆ, ω    ˆ K
                        ˆ , {θ} j=0

Full-dataset algorithm for approximate posterior inference for the HDP hidden Markov model shown
in Fig. 6.1.


every batch at least once. This corresponds to completing the first epoch or lap of memoized
inference. For this computation, the only auxiliary statistic we need is the entropy statistic of the
pair-wise responsibilities sˆ.
                                                                                                                  152


Entropy statistic for ELBO computation. At batch b, let Hbjk define the sum of assignment
entropies related to the transition between clusters j and k across all sequences in the batch:
                               K X
                               X K                              B
                                                                X                            X
                                          G            G
              Lentropy (ˆ
                        s) =             Hjk (ˆ
                                              s),     Hjk   =         Hbjk ,          s) ,
                                                                                Hbjk (ˆ             Hdjk (ˆ
                                                                                                          s)    (6.49)
                               j=0 k=1                          b=1                          d∈Db

where the scalar entropy statistic Hdjk for the assignment posterior q(zd ) of sequence d is defined
in Eq. (6.25). We compute this scalar for each possible source cluster j ∈ {0, 1, 2, . . . K}, including
the starting state label j = 0, and for every possible destination cluster k ∈ {1, 2, . . . K}. Again, our
assumed truncation directly prevents any transition to clusters beyond index K. These infinitely
many inactive clusters all have zero probability mass and thus zero entropy.


6.5      Variational algorithms with proposal moves that adapt
         the number of clusters
6.5.1     Merge proposals
Merge proposals try to find a less redundant but equally expressive model. Each proposal takes a
pair of existing states i < j and constructs a candidate model where data from state j is reassigned
to state i. Conceptually this reassignment gives a new value sˆ′ , but instead statistics M ′ , S ′ can be
                                                                                         ˆ ′ , θˆ′ .
directly computed and used in a global update for candidate parameters τˆ′ , νˆ′ , uˆ′ , ω

      Si′ = Si + Sj ,   M:i′ = M:i + M:j ,          Mi:′ = Mi: + Mj: ,         Mii′ = Mii + Mjj + Mji + Mij .

While most terms in L are linear functions of our cached sufficient statistics, the entropy Lentropy
is not. Thus for each candidate merge pair (i, j), we use O(K) storage and computation to track
column H:i′ and row Hi:′ of the corresponding merged entropy matrix H ′ . Because all terms in the
H ′ matrix of Eq. (6.24) are non-negative, we can lower-bound Lentropy by summing a subset of H ′ .
   This allows us to make coherent accept or reject decisions for multiple pairs, while guaranteeing
that we never accept a merge unless the resulting configuration of global and local parameters would
yield an improved objective function value. The only caveat is that, like DP mixtures or HDP topic
models, we can only accept merge pairs i, j where both cluster indices have not been involved in any
previous accepted merges during that pass. Because many entries of H ′ are near-zero, our bound
is very tight, and in practice enables us to scalably merge many redundant state pairs in each lap
through the data.

Selecting candidate merge pairs.

Before a single pass through the dataset, we examine all pairs of states and keep those pairs i, j
which improve a score function similar to that used for DP mixture models in Sec. 4.1.4. That is, we
use all terms of the objective L for which a candidate value for pair i, j can be constructed directly
from global sufficient statistics and global parameters.
                                                                                                          153


6.5.2                Birth proposals
Our birth moves for the HDP-HMM can create many new states at once while maintaining the
monotonic increase of the whole-dataset objective, L. Each proposal happens within the local step
by trying to improve q(zd ) for a single sequence d. Given current assignments sˆd with truncation K,
the move proposes new assignments sˆ′d that include the K existing states and some new states with
index k > K. This procedure is detailed below. Next, we combine the proposed value of sˆ′d with
the existing summary statistics for other sequences to obtain valid candidate summary statistics
    ′            ′            ′
        G            G            G
M           ,N           ,S
              for the whole dataset. Given these global summaries, we can compute candidate
                        ˆ ′ , θˆ′ , τˆ′ , νˆ′ via the standard global coordinate ascent step updates. Finally,
global parameters uˆ′ , ω
we can compute L′ for the candidate value, and compare it to the original configuration’s objective
function score L. If L improves under the proposal, we accept and use the expanded set of states
for all remaining updates in the current lap.

Constructing proposed assignments sˆ

The proposal for expanding the pair-wise assignments sˆ′d for sequence d with new states can flexibly
take any form, from very na¨ıve to very data-driven. For data with “sticky” state persistence, we
recommend randomly choosing one interval [t, t+δ] of the current sequence to reassign when creating
sˆ′d , leaving other timesteps fixed. We split this interval into two contiguous blocks (one may be
empty), each completely assigned to a new state. This search finds the cut point that maximizes the
observation-model objective Ldata . Other proposals such as sub-cluster splits Chang and Fisher III
(2014) could be easily incorporated in our variational algorithm, but we find this simple interval-
based proposal to be fast and effective.


6.5.3                Delete Proposals
Our proposal to delete a rarely-used state j begins by dropping row j and column j from M to create
M ′ , and dropping Sj from S to create S ′ . Using a target dataset of sequences with non-trivial mass
                        PTn
on state j, x′ = {xn : t=1   rˆntj > 0.01}, we run global and local parameter updates to reassign
observations from former state j in a data-driven way. Rather than verifying on only the target
dataset as in Hughes et al. (2015a), we accept or reject the delete proposal via the whole-dataset
bound L. To control computation, we only propose deleting states used in 10 or fewer sequences.

Candidate selection. Deletes are more flexible than merges at reassigning mass but require
expensive local steps on the target data. To keep costs affordable, before each lap we find a group
of L states whose unified target set is at most 10 sequences, prioritizing states that have not been
attempted before. We then collect the unified target set and try all L deletes one at a time. If no
group can be found (L = 0), no move is performed.
                                                                                                   154


6.6     Experimental Results
In this section, we present results from several experiments comparing various training algorithms
for the HDP-HMM. We consider two variants of our scalable memoized algorithm with adaptive
proposals: one with all three proposal moves (birth, merge, and delete) and one with only proposal
moves that remove clusters (merge, delete).
   As a baseline, we consider two scalable algorithms that also optimize the same objective L: a
fixed-truncation memoized algorithm and a fixed-truncation stochastic algorithm. We further com-
pare to the blocked Gibbs sampler for the HDP-HMM (Fox et al., 2011) that was previously shown
to mix faster than slice samplers (Van Gael et al., 2008). These baselines (including the blocked
sampler) maintain a fixed number of states K, though some states may have usage fall to zero as
training proceeds. We start all fixed-K methods (including the sampler) from matched initializa-
tions, so comparisons are fair. See the original paper (Hughes et al., 2015b) and its supplement for
futher discussion and all details needed to reproduce these experiments. Our publically released
Python source code can be found online1 .

Learning rate schedule. For all datasets, we set the learning rate ρt for stochastic variational at
update iteration t to ρt = (1 + t)0.51 . This is a fairly aggressive schedule, recommended in past work
by Bryant and Sudderth (2012). Future work could tune this specifically for each dataset, but we
chose to simplify the comparisons here. Note that under this setting, SVI reaches noticeably better
objective scores than the fixed-truncation memoized algorithm on the chromatin experiments, but
performs worse on the larger-scale motion capture experiments.

Initialization. Across all experiments, we used the same procedure to initialize algorithms given
the provided data sequences and a specific number of clusters K. We call this procedure ”random-
contiguous-blocks”, since it selects subwindows of data sequences at random and uses these to create
the global likelihood parameters (via the standard global step).
   We found this worked well with all datasets here. To generalize to a new dataset, though, it is
often advantageous to define the window length used as longer or shorter than our default, since
some datasets have much longer segments than others. Also, higher-dimensional datasets can benefit
from using more data in initialization. Of course, if the dataset has very little self-transition (lots
of fast-switching states), another type of initialization may be preferred.


6.6.1    Toy Data
In Fig. 6.2, we study 32 toy data sequences generated from 8 Gaussian states with sticky transi-
tions. This simple toy dataset was suggested by previous work on scalable hidden Markov models
(Foti et al., 2014) as a simple check of algorithm quality. From an abundant initialization with 50
  1 http://www.bitbucket.org/michaelchughes/bnpy-dev/
                                                                                                  155


states, the sampler and non-adaptive variational methods require hundreds of laps to remove redun-
dant states, especially under a non-sticky model (κ = 0). In contrast, our adaptive methods reach
the ideal of zero Hamming distance within a few dozen laps regardless of stickiness, suggesting less
sensitivity to hyperparameters.
   This toy dataset has 32 sequences divided into B = 8 batches. Each sequence has length
T = 1000. Each observation is a 2D real vector xnt ∈ R2 . We use a full-covariance Gaussian for
likelihood L and a corresponding Wishart distribution for the prior P. When we show segmentations
in Fig. 6.2, we always show segmentations for sequences 1, 3, 5, and 7, which together contain at
least 50 timesteps representing each of the 8 true states.
   In Fig. 6.2, we show trace plots that compare the progress of various algorithms under two set-
tings: non-sticky dynamics (κ = 0) and sticky dynamics (κ = 50). Comparing non-sticky and sticky
models, we see that the sticky model generally encourages faster convergence for all algorithms. In
particular, for the Gibbs sampler of Fox et al. (2011), the non-sticky sampler takes thousands of laps
through the dataset for Hamming distance to drop near zero, but it does reach that configuration,
as illustrated by the segmentations in the bottom row. In contrast, the sticky sampler reaches this
ideal by around 200 laps according to the trace plots. Thus, the performance of the sampler can
be quite sensitive to the value of the provided sticky hyperparameter. We thank an anonymous
reviewer for suggesting this detailed analysis.
   Note that across Fig. 6.2 in both sticky and non-sticky cases, our adaptive algorithms with
birth/merge/delete proposals eliminate the redundant states more quickly than non-adaptive com-
petitors. Our proposal moves enable fast convergence regardless of the hyperparameter values,
suggested that algorithms with greater power to escape local optima can avoid some sensitivity
exhibited by more limited methods.


6.6.2     Speaker Diarization
We study 21 unrelated audio recordings of meetings with an unknown number of speakers from the
NIST 2007 speaker diarization challenge NIST (2007). The sticky HDP-HMM previously achieved
state-of-the-art diarization performance Fox et al. (2011) using a sampler that required hours of
computation. We ran methods from 10 matched initializations with 25 states and κ = 100, com-
puting Hamming distance on non-speech segments as in the standard DER metric. Fig. 6.3 shows
that within minutes, our algorithms consistently find segmentations better aligned to true speaker
labels.
   There are N = 21 sequences, which have no overlap in terms of common speakers. Thus,
we process each one independently. This makes “memoized” inference equivalent to full-dataset
inference because there is only one batch. It also makes stochastic inference irrelevant, so we only
compare to the sampler as a baseline method. We use a full covariance Gaussian likelihood with
corresponding Wishart prior.
   For Hamming distance computations on this dataset, we utilize the provided annotations of
each sequence into “background” (non-speech) and “foreground” (speech) states. We only count
                                                                                                      156


timesteps labeled as foreground in the distance computation, and ignore any assignments to timesteps
labelled background. Our dataset clearly marks background labels with negative integer labels, while
foreground states have non-negative labels {0, 1, ...}.


6.6.3     Motion capture dataset.
Labelled 6-sequence motion capture dataset.               Next, we consider the task of unsupervised
discovery of activity types from motion capture sensor data captured over time. Fox et al. (2014)
presented a benchmark dataset with 6 total sequences, each labelled at every timestep as one of 12
exercise types, such as jogging and jumping-jacks. The raw sensor data as well as the labels provided
by a human annotator are illustrated in Fig. 1.4. Each sequence has 12 joint angles (wrist, knee,
etc.) captured at 0.1 second intervals. To model this data, we use a first order auto-regressive (AR)
Gaussian likelihood with the corresponding matrix-Normal-Wishart conjugate prior. For online
algorithms, we process each of the 6 sequences as its own batch.
   Fig. 6.4 shows that non-adaptive methods struggle even when initialized abundantly with 30
(dashed lines) or 60 (solid) states, while our adaptive methods reach better values of the objective
L and cleaner many-to-one alignment to true exercises. It seems the 30 state models are slightly
preferred (especially for the sampler). However, for our adaptive models with deletes and merges
(red curves) and with births (purple), the number of states in the initialization does not seem to
matter too much.

Large 124-sequence motion capture dataset.                Next, we apply scalable methods to the 124
sequence dataset of Fox et al. (2014). Again, we use first-order auto-regressive (AR) likelihoods
with the corresponding conjugate priors. For this larger dataset, we process all 124-sequences as 20
distinct batches, each containing about 6 sequences.
   Fig. 6.5 shows deletes and merges making consistent reductions from abundant initializations
and births growing from K = 1. Fig. 6.5 also shows estimated segmentations for 10 representative
sequences, along with skeleton illustrations for the 10 most-used states in this subset. These segmen-
tations align well with held-out text descriptions of the raw sensor sequences, which are found on
the source website. We lack ground truth exercise labels at each timestep, but these whole-sequence
labels provide a proxy for judging the resulting segmentation. Scripts for visualizing the skeleton
trace of a specific data segment can be found in an open-source code repository available online2 .
   For non-adaptive methods on the 124-sequence dataset, we compare each algorithm initialized
from abundant initializations of 100 (dashed) and 200 (solid) states. For the stochastic method (SVI,
yellow curves), under all initializations we see a rapid drop in the number of states used during the
first lap. To explain this, remember that with 20 batches for 124 sequences, each batch will have
around 6 sequences. From the segmentation figure, it is clear that state usage patterns have lots of
variety across sequences, which each sequence only using a handful of states. The aggressive learning
rate we use in the first lap will tend to severely downweight any initial states not used in the first few
  2 http://github.com/michaelchughes/mocap6dataset/
                                                                                                157


batches, which explains the rapid drop. In contrast, the memoized method (blue) is designed to use
global information for each parameter update, not just the current batch. We further enforce this
by delaying the first global update until at least 50 sequences are seen. This makes the memoized
results a large improvement on the stochastic results for this dataset.
   Furthermore, using delete and merge moves only (red) shows that we can reduce down to about
30 states and reach even higher levels of performance. Similarly, starting from 1 state with birth
moves (purple), we can grow to nearly comparable levels of performance. We hope to answer why
the purple curves do not quite reach the performance of the red curves in future work. Regardless,
the set of adaptive methods reach high objective scores much more consistently than non-adaptive
methods.


6.6.4      Chromatin epigenomic dataset
Finally, we study segmenting the human genome by the appearance patterns of regulatory proteins
Hoffman et al. (2012). This problem of chromatin segmentation was motivated earlier in Ch. 1 and
illustrated in Fig. 1.5.
   We observe 41 binary signals from Ernst and Kellis (2010) at 200bp intervals throughout a white
blood cell line (CD4T). Each binary value indicates the presence or absence of an acetylation or
methylation that controls gene expression. Such binary observations are naturally explained by a
Bernoulli likelihood with a conjugate Beta prior. We set our prior to Beta(0.1, 0.3), which favors
clusters with extreme probabilities (nearly 0 or nearly 1).
   We divide the whole epigenome into 173 sequences (one per batch) with total size T = 15.4
million. Some efforts choose to process each chromosome as one very long sequence, but we elected to
test the ability of our algorithms to handle many batches. To split up each chromosome, we searched
for intervals with at least 50 consecutive all-zero observations, which are somewhat common. We
picked division points in the middle of these empty segments to use to split up into more manageable
size sequences, while avoiding artifacts at the starts of each sequence as much as possible. In the
end, we obtained 173 sequences, ranging in size from T = 10000 to T = 200, 000 timesteps.
   Fig. 6.6 shows our adaptive proposal optimization algorithms can grow from 1 state to 70 states
and compete favorably with non-adaptive competitors even on genome-scale datasets. Even when
fixed truncation algorithms are given 100 initial clusters, they do not achieve ELBO scores as high
as the nearly 75 clusters discovered by our method.
   Fig. 6.6 also demonstrates that parallel processing of distinct sequences in the local step can
produce useful speed improvements. Using 64 cores working in parallel on a K = 50 cluster problem,
we can complete the fixed-truncation local step 25x times faster than a serial implementation. This
lowers the time required to complete a local step on the entire dataset (173 sequences totalling 15
million observations) from over an hour to less than 2 minutes. This practical speed gain makes
large-scale analysis with our HDP-HMM optimization algorithms possible.
                                                                                                   158


6.7     Discussion
Extension of our previous methods to the HDP hidden Markov model required further technical in-
novation to derive a surrogate optimization problem for the sticky HDP as well as several innovations
to make the proposal moves (birth, merge, delete) apply to sequential data.
   Among open issues, we note that the primary obstacle to large-scale scalability of these methods
is the O(K 2 Td ) of the local step at each sequence d. This quadratic dependence on the number
of active clusters K makes training with many clusters very slow even with the parallelization we
have developed. Finding slightly different models or inference assumptions which would improve
this scaling to nearly-linear in K could make it possible to train effectively on millions of sequences
and thousands of hidden states or clusters.
                                                                                                               159


Algorithm 6.2 Memoized variational algorithm for the HDP-HMM
Input:
    {xd }D  d=1 : dataset with D sequences, each of length Td
    K: truncation level
    {ˆτk , νˆk }Kk=1 : initial global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : initial global parameters of allocation model defining q(u)
    {{θˆjk }K+1       K
               k=1 }j=0 : initial global parameters defining Markov transition probabilities
    γ, α, α0 : allocation model hyperparameters
    τ¯, ν¯ : observation model prior hyperparameters
Output:
    {ˆτk , νˆk }K
                k=1 : updated global parameters of observation model
    {ˆuk , ωˆ k }Kk=1 : updated global parameters of allocation model
         ˆ     K+1 K
    {{θjk }k=1 }j=0 : updated global parameters defining Markov transition probabilities.
 1: function MemoizedCoordAscentForHDPHMM(x, K, τ                           ˆ, νˆ, u
                                                                                   ˆ, ω   ˆ
                                                                                      ˆ , θ)
 2:       for lap l ∈ 1, 2, . . . do
 3:            for batch b ∈ Shuffle({1, 2, . . . B}) do                                     ⊲ Local step at batch b
 4:                 for d ∈ Db do
 5:                      sˆd , rˆd ← FwdBwdAlgForSeq(Cd (ˆ                ˆ
                                                              τ , νˆ), P (θ))

 6:              S G ← S G − Sb                                                   ⊲ Decrement global statistics
 7:              N G ← N G − Nb
 8:              M G ← M G − Mb

 9:              for k ∈ 1, 2,
                            P. . . KPdo                                             ⊲ Summary step at batch b
10:                 Sbk ← D              Td
                                         t=1 r
                                             ˆdtk s(xdt )
                            Pd=1
                               D PTd
11:                 Nbk ← d=1 t=1 rˆdtk
                               PD PTd
12:                 Mb0k ← d=1 t=1             rˆd1k
13:                 for j ∈ 1, 2, P. . . K do
                                       D PTd
14:                     Mbjk ← d=1 t=1             sˆdtjk

15:              S G ← S G + Sb                                                   ⊲ Increment global statistics
16:              N G ← N G + Nb
17:              M G ← M G + Mb

18:              for k ∈ 1, 2, . . . K do                           ⊲ Global step for observation parameters
19:                 τˆk ← Sk + τ¯
20:                 νˆk ← Nk + ν¯

21:              while not converged do
22:                  for j ∈ 0, 1, 2, . . . K do
23:                       for k ∈ 1, 2, . . . K + 1 do             ⊲ Update global transition probabilities q(πj )
24:                          θˆjk ← Mjk + αj πkG (ˆ     u ) + δk κ
25:                       for k ∈ 1, 2, . . . K + 1 P  do
                             Pjk ← ψ(θˆjk ) − ψ( ℓ=1 θˆjℓ )
                                                         K+1
26:
27:                  uˆ ← arg maxuˆ f (ˆ  u, ω       ˆ α, γ, κ)
                                              ˆ , P (θ),                ⊲ Update global top probabilities q(π G )
      return τˆ, νˆ, u       ˆ  K
                        ˆ , {θ}j=0
                     ˆ, ω
Memoized coordinate ascent algorithm for approximate posterior inference for the HDP-HMM.
                                                                                                                                                                  160


                              −1.4                                                                                                    0.8


                                                                                                                      Hamming dist.
                                                                    num states K
                                                                                   80

            train objective
                                                                                                                                      0.6
                              −1.5                                                 60
                                                                                                                                      0.4
κ = 0:                                                                             40
                                                                                                                                      0.2
                              −1.6                                                 20
                                                                                    8
                                                                                                                                      0.0
                                      1   10       100 1000                              1    10   100 1000                                 1     10   100 1000
                                     num pass thru data                                 num pass thru data                                  num pass thru data

                              −1.4                                                                                                    0.8


                                                                                                                      Hamming dist.
                                                                    num states K
                                                                                   80
            train objective


                                                                                                                                      0.6
                              −1.5                                                 60
                                                                                                                                      0.4
κ = 50:                                                                            40
                                                                                                                                      0.2
                              −1.6                                                 20
                                                                                    8
                                                                                                                                      0.0
                                      1   10       100 1000                              1    10   100 1000                                 1     10   100 1000
                                     num pass thru data                                 num pass thru data                                  num pass thru data

                                               sampler: K=10 after 2000 laps in 74 min.                sampler: K=8 after 5000 laps in 169 min.

         stoch
         sampler
         memo
         delete,merge
         birth,delete,merge
                                               0      200     400                  600       800       0        200                   400       600    800


  30
                                               delete,merge: K=8 after 100 laps in 5 min.                  stoch: K=47 after 2000 laps in 359 min.
  15

   0

−15

−30
       −30 −15 0                     15 30
                                               0      200     400                  600       800       0        200                   400       600    800


    Figure 6.2: Toy data HDP-HMM algorithm comparison, using sticky and non-sticky model.
Algorithm comparison on toy dataset, using a non-sticky state transition model with κ = 0 (top
row ) and sticky model with κ = 50 (middle row ). Left column: Objective function L as more
training data is seen. Middle column: Number of effective states K (states with Nk ≥ 1) as more
data is seen. Right column: Hamming distance between aligned segmentations and the ground
truth segmentation. All non-birth algorithms are initialized using a common set of over-complete
segmentations, with either K = 50 (dashed lines) or K = 100 states (solid lines). The sticky model
encourages faster convergence for all algorithms. The non-sticky sampler takes thousands of laps
through the dataset for Hamming distance to drop near zero, but it does reach that configuration, as
illustrated in the segmentations in the bottom rows. In contrast, the sticky sampler reaches this ideal
by around 200 laps according to the middle trace plots. Both sticky and non-sticky models clearly
prefer the ideal segmentation, but algorithm convergence is sensitive to the sticky hyperparameter,
especially the fixed-truncation variational algorithms. In contrast, our adaptive methods reach ideal
configurations within 20 laps regardless of whether the model is sticky or not.
                                                                                                                                                                                                                                         161


                                                                                             Meeting 11 (best)                                            Meeting 16 (avg.)                                   Meeting 21 (worst)
                                                                                       −2.50
                         0.6                                                                                                                      −2.55                                                     −2.40


                                                                                                                                train objective


                                                                                                                                                                                          train objective
                                                                     train objective
       sampler Hamming


                                                                                       −2.55
                         0.5
                                                                                       −2.60                                                      −2.60                                                     −2.45
                         0.4
                                                                                       −2.65
                         0.3
                                                                                       −2.70                                                      −2.65                                                     −2.50
                         0.2

                         0.1                                                           −2.75
                                                                                                                                                  −2.70                                                     −2.55
                         0.0                                                           −2.80
                           0.0   0.1   0.2   0.3   0.4   0.5   0.6                                    1        10    100 1000                                      1    10   100 1000                                   1    10   100 1000
                               delete-merge Hamming                                               elapsed time (sec)                                           elapsed time (sec)                                   elapsed time (sec)

                                  sampler                                                0.8                                                         0.8                                                      0.8
                                                                         Hamming dist.


                                                                                                                                    Hamming dist.


                                                                                                                                                                                              Hamming dist.
                                  memo
                                  delete,merge                                           0.6                                                         0.6                                                      0.6
                                  birth,delete,merge
                                                                                         0.4                                                         0.4                                                      0.4

                                                                                         0.2                                                         0.2                                                      0.2

                                                                                         0.0                                                         0.0                                                      0.0
                                                                                                  1           10    100 1000                                   1       10    100 1000                               1       10    100 1000
                                                                                              elapsed time (sec)                                           elapsed time (sec)                                   elapsed time (sec)
     Figure 6.3: Comparison of HDP-HMM algorithms on 21 speaker diarization sequences.
Method comparison on speaker diarization from common K = 25 initializations (Sec. 6.6.2). Left:
Scatterplot of final Hamming distance for our adaptive method and the sampler. Across 21 meetings
(each with 10 initializations shown as individual dots) our method finds segmentations closer to
ground truth. Right: Traces of objective L and Hamming distance for meetings representative of
good, average, and poor performance.


                                                                                             60                                                          0.8
                                                                                                                                                                                                                    stoch
                                                                                                                                         Hamming dist.
train objective


                                                                              num states K


                         −2.2                                                                                                                            0.6                                                        sampler
                                                                                             40                                                                                                                     memo
                         −2.4                                                                                                                            0.4                                                        delete,merge
                         −2.6                                                                20                                                                                                                     birth,delete,merge
                                                                                                                                                         0.2

                         −2.8                                                                 0                                                          0.0
                                  1          10     100 1000                                      1           10    100 1000                                   1       10    100 1000
                                num pass thru data                                            num pass thru data                                           num pass thru data
               birth: Hdist=0.34 K=28 @ 100 laps                                                      del/merge: Hdist=0.30 K=13 @ 100 laps sampler: Hdist=0.49 K=29 @ 1000 laps


     0                    50     100 150 200 250 300 350 400                                              0    50   100 150 200 250 300 350 400                                0   50   100 150 200 250 300 350 400

        Figure 6.4: Comparison of HDP-HMM algorithms on 6 motion capture sequences.
Comparison on 6 motion capture streams (Sec. 6.6.3). Top: Our adaptive methods reach better L
values and lower distance from true exercise labels. Bottom: Segmentations from the best runs of
birth/merge/delete (left), only deletes and merges from 30 initial states (middle), and the sampler
(right). Each sequence shows true labels (top half) and estimates (bottom half) colored by the true
state with highest overlap (many-to-one).
                                                                                                                                                                                                    162


                                                             200
                  −2.4
train objective


                                              num states K
                                                                                                              1-1: playground jump

                                                             150                                              1-2: playground climb
                                                                                                              1-3: playground climb
                  −2.5                                                                                        2-7: swordplay
                                                             100                                              5-3: dance
                                                                                                              5-4: dance
                  −2.6                                             50                                         5-5: dance
                                                                                                              6-3: basketball dribble
                                                                          0                                   6-4: basketball dribble
                         1   10    100 1000                                            1       10 100 1000    6-5: basketball dribble

                     num pass thru data                                     num pass thru data                                           !                    "!          #!     $!         %!      &!!


                                  Walk                                                 Climb                 Sword                                                 Arms                 Swing


                              Dribble                                                  Jump                  Balance                                         Ballet Leap              Ballet Pose
       Figure 6.5: Comparison of HDP-HMM algorithms on 124 motion capture sequences.
Study of 124 motion capture sequences (Sec. 6.6.3). Top Left: Objective L and state count K as
more data is seen. Solid lines have 200 initial states; dashed 100. Top Right: Final segmentation of
10 select sequences by our method, with id numbers and descriptions from mocap.cs.cmu.edu. The
10 most used states are shown in color, the rest with gray. Bottom: Time-lapse skeletons assigned
to each highlighted state.
                                                             objective (x100)


                                                                                                                                                       100
                                                                                                                                        num states K


                         stoch
                                                                                -3.4
                         memo                                                                                                                          75
                         birth,del,merge
                                                                                -3.5                                                                   50
                         K=50
                         K=100                                                                                                                         25
                                                                                -3.6
                                                                                                                                                        0
                                                                                           1         10       100                                                  0.1    1    10 100
                                                                                   num pass thru data                                                    num pass thru data
                                                                    1200
                                                                                                                                              64x
                                                                    1000
                                                                                                                                              32x
                                                                                                                                speedup
                                                time (sec)


                                                                            800                                                               16x
                                                                            600                                                                8x
                                                                            400                                                                4x
                                                                                                                                               2x
                                                                            200
                                                                                                                                               1x
                                                                                  0
                                                                                       1 2 4 8 16 32 64                                                     1 2 4 8 16 32 64
                                                                                  num parallel workers                                                  num parallel workers

Figure 6.6: Comparison of HDP-HMM algorithms for chromatin segmentation of human genome.
Segmentation of human epigenome: 15 million observations across 173 sequences (Sec. 6.6.4). Top
Row: Adaptive runs started at 1 state grow to 70 states within one lap and reach better L scores
than 100-state non-adaptive methods. Each run takes several days. Bottom Row: Wallclock times
and speedup factors for a parallelized fixed-truncation local step on 1/3 of this dataset. 64 workers
complete a local step with K = 50 states in under one minute.
Chapter 7

Sparse variational posteriors for
cluster and topic assignments

In this chapter, our overall goal is to improve scalability of training algorithms by reducing the
runtime cost of the most expensive step in our earlier coordinate ascent optimization algorithms:
the local step. Just as our earlier memoized algorithms helped us scale to large dataset sizes N , we
now wish to improve scaling with larger numbers of clusters or topics K.
   Towards this end, we investigate alternative posterior approximations for the local assignment
variables z in mixture models and topic models. Our previous efforts for these models in Ch. 3 and
Ch. 5 used a conventional approximate posterior q(z) which explicitly represented the probability
of each data token being assigned to each of the K active clusters. Here, we propose a constrained
family of sparse variational distributions that allow at most L < K non-zero entries in the learned
resposibility parameter vector for each local assignment, where the user-specified threshold L trades
off speed for accuracy. Setting L = K recovers the dense representation of our earlier conven-
tional approaches, while the other extreme of setting L = 1 yields a “hard“ or “winner-take-all”
assignments. We will show that moderate values such as L ≈ 4 can produce significant speed gains
compared to dense L = K conventional representations when K >> 100, while still producing
similar-quality heldout predictions. Our approach fits into any variational algorithm for parametric
or nonparametric models regardless of whether global parameters are inferred by point estimates
(as in EM) or given full approximate posteriors. Furthermore, our approach easily integrates into
existing frameworks for large-scale streaming data analysis (Hoffman et al., 2013; Broderick et al.,
2013), and is easy to parallelize.
   At the time of publication, the work in this chapter is under review for a machine learning
conference. To summarize, the core contributions are:

Contribution 1: New variational algorithm for mixture models with user-specified L-
sparse assignments. Our approach is the first to use O(L) memory to represent the assignment

                                                163
                                                                                                   164


posterior q(zn ) for data token n. Hard assignment algorithms with L = 1 have long been known
in the literature, but no procedure to coherently optimize an L-sparse q(zn ) for 1 < L < N was
previously known to our knowledge. Algorithms like sparse EM (Neal and Hinton, 1998) allow the
update to q(zn ) to process fewer than K indices of the corresponding responsibility vector during an
update, but must represent all K entries in memory. In contrast, our approach requires less storage
and less processing time, especially for computing summary statistics.

Contribution 2: New variational algorithm for topic models with user-specified L-sparse
assignments. We present a similar algorithm for topic models, for which hard assignments via
MCMC sampling had been the most prominent previous way to attain L = 1 sparsity in q(zn )
(Mimno et al., 2012).

Contribution 3: Experimental evidence that moderate L values offer good speed-
accuracy tradeoff.      We often find it favorable to choose moderate values of L larger than the
winner-take-all baseline but much smaller than the total number of active clusters K. Our later
experiments show that the hard assignment condition L = 1 often leads to pathological behavior,
especially in topic models.


7.1      Local step algorithms for L-sparse mixture models
In this section, we focus our attention on the local step of the coordinate ascent algorithm for
the finite mixture model presented in Ch. 2. We begin by reviewing the conventional local step
update that produces dense responsibility vectors, which was first presented in Sec. 2.6.1. Later,
we introduce our new sparse variational approximate posterior q(zn ), which adds an additional
constraint to the conventional dense formulation and derive the appropriate coordinate ascent update
for this new approximation. We emphasize that while we’ve chosen to present the material here for
the finite mixture model for simplicity, the parallels to the local step for DP mixture models in Ch. 3
are straight-forward.


7.1.1     Local step with conventional dense responsibilities.
The local step of inference computes a new assignment vector rˆn for each observation n which
optimizes L while holding global parameters fixed. Under either the approximate posterior treatment
of global parameters in Eq. (2.93) or ML estimates of Eq. (2.73), the optimal updates have closed
form, which we derive by expanding the expectations that define L and dropping terms independent
of rˆn . After simplification, we have
                                                   PK
                                Ln (xn , rˆn ) =    k=1 r
                                                        ˆnk Wnk (xn )   − rˆnk log rˆnk ,         (7.1)
                                    ˆ τˆ, νˆ) , Eq [log πk ] + Eq [log p(xn |φk )].
                          Wnk (xn , θ,
                                                                                                                                             165


             RespFromWeights step                             Gaussian summary step                          Distance from dense L = K
             0.20                                           12.5                                             100
                      L=K                                             L=K
             0.16     L=64                                  10.0      L=64                                   80


                                                                                                Percentile
time (sec)


                                               time (sec)
                      L=16                                            L=16
             0.12     L=4                                    7.5      L=4                                    60
                      L=1                                             L=1                                                             L=8
             0.08                                            5.0                                             40
                                                                                                                                      L=4
             0.04                                            2.5                                             20                       L=2
                                                                                                                                      L=1
             0.00                                            0.0                                              0
                     200   400   600     800                        200    400   600      800                  0.0        0.2   0.4   0.6
                    num clusters K                                 num clusters K                                    Variational dist.
  Figure 7.1: Speed and accuracy of L-sparse assignment posteriors for training mixture models
Impact of sparsity-level L on different substeps of estimating local assignments for a mixture model.
L defines the number of non-zero entries in the posterior responsibility vector rˆn for each observation
n. For these experiments, we use a minibatch of N = 36,000 8x8 image patches and a pretrained
mixture model with zero-mean Gaussian likelihood, as described in Sec. 7.2.1. Left: Comparison of
methods from Alg. 2.3 for computing optimal responsibilities rˆn given fixed log posterior weights
Wnk for each cluster. When L = K we use DenseRespFromWeights while for L < K we find
TopLRespFromWeights is much faster. Center: Given fixed assignments rˆ, the summary step
computes the per-cluster sufficient statistics {Nk , Sk }K
                                                         k=1 defined in Eq. (7.3). Here, we measure
the time required for this computation for our minibatch of 8x8 patch data, where the observation
statistic s(xn ) = xn xTn is 64x64. Right: Comparison of variational distance between optimal dense
responsibilities and those from TopLRespFromWeights across various L values. We show the
cumulative density function across all patches in our minibatch, using the pretrained K = 200 model
published online by Zoran and Weiss (2012). Moderate values L ≈ 8 are almost indistinguishable
from the dense solution.

We interpret Wnk ∈ R as the log posterior weight that cluster k has for observation n. Larger values
imply that cluster k is more likely to be assigned to observation n. For ML or MAP learning, the
expectations defining Wnk are replaced with point estimates.
             Our goal is to find the cluster responsibility vector rˆn that optimizes Ln in Eq. (7.1), subject to
the constraint that the entries of rˆn are non-negative and sum to one:
                                                                                                P
                                       rˆn∗ = arg max Ln (xn , rˆn ) s.t. rˆn ≥ 0,                  kr
                                                                                                     ˆnk           = 1.                     (7.2)
                                                        rˆn

The optimal solution is simple: exponentiate each weight and then normalize the resulting vector.
The function DenseRespFromWeights in Alg. 7.1 details the required steps. The runtime cost
is O(K), dominated by K required evaluations of the exp function.

Summary statistics. Given fixed assignments rˆ, the global step computes the optimal values
of the global free parameters under L. Whether doing point estimation or approximate posterior
inference, this update requires only two finite-dimensional sufficient statistics of rˆ, rather than the
complete rˆ matrix. For each cluster k, we must compute the expected count Nk ∈ R+ of its
assigned observations and the expected data statistic vector Sk . These summary statistics were first
introduced in Eq. (2.96), and are defined again here for convenience:
                                                            PN                            PN
                                       Nk (ˆ
                                           r) =                n=1 r
                                                                   ˆnk ,   Sk (x, rˆ) =    n=1 r
                                                                                               ˆnk s(xn ).                                  (7.3)
                                                                                                  166


Algorithm 7.1 Update for dense responsibilities given log posterior weights for mixture model.
Input: [Wn1 . . . WnK ] : log posterior weights.
Output: [ˆ rn1 . . . rˆnK ] : resp. values for each cluster
 1: function DenseRespFromWeights(Wn )
 2:    for k ∈ 1, . . . K do
 3:       rˆnk = eWnk
             P
 4:    sn = K   k=1 r  ˆnk
 5:    for k ∈ 1, . . . K do
 6:       rˆnk = rˆnk /sn
 7:    return rˆn
DenseRespFromWeights is the conventional method for solving the optimization problem in
Eq. (7.2). It delivers a dense vector rˆn of K non-zero entries, satisfying non-negativity and sum-to-
one constraints. The runtime cost requires K evaluations of the exp function, K summations, and
K divisions. For large K, can be expensive, especially because the resulting vector rˆn often has a
vast majority of entries indistiguishable from zero.

Algorithm 7.2 Update for L-sparse responsibilities given log posterior weights for mixture model.
Input: [Wn1 . . . WnK ] : log posterior weights.
Output: {ˆ    rnℓ , inℓ }L
                         ℓ=1 : top L values and indices
 1: function TopLRespFromWeights(Wn , L)
 2:    in1 , . . . inL = SelectTopL(Wn )
 3:    for ℓ ∈ 1, . . . L do
 4:         rˆnℓ = eWninℓ
                PL
 5:    sn = ℓ=1 rˆnℓ
 6:    for ℓ ∈ 1, . . . L do
 7:         rˆnℓ = rˆnℓ /sn
 8:    return rˆn , in
TopLRespFromWeights optimizes the same objective as RespFromWeights, subject to the
additional constraint that at most L clusters can have non-zero posterior probability in rˆn . The
L-sparse optimization problem this procedure solves is in Eq. (7.4). To tractably solve this problem,
an O(K) introspective selection algorithm (Musser, 1997) first identifies the indices of the L largest
values of the weight vector. After this, we can find the optimum with L evaluations of the exp
function, L summations, and L divisions. This cost can be substantially less than the cost of dense
procedure RespFromWeights, as shown in Fig. 7.1.


These sums have cost linear in the number of observations N . The required work is O(N K) for
the count vector and O(N KD) for the data vector, where D is the dimension of the data statistic
vector: s(xn ) ∈ RD .


7.1.2    Local step with L-sparse responsibilities.
To speed up the local steps above, we recognize that much of the runtime cost comes from rep-
resenting rˆn as a dense vector. Although there are K clusters, for any observation n only a few
entries in rˆn will have appreciable mass while the vast majority are close to zero. We thus introduce
an additional constraint to our optimization problem: that at most L entries are non-zero, where
                                                                                                              167


1 ≤ L ≤ K. The formal problem is:

                                          rˆn∗ = argmaxrˆn Ln (xn , rˆn )                                   (7.4)
                                           PK               PK
                            s.t. rˆn ≥ 0,      k=1 r
                                                   ˆnk = 1,    k=1 1(ˆ rnk > 0) = L.

This constrained optimization problem has a simple solution, given by the function TopLRe-
spFromWeights in Alg. 7.1. First, we identify the indices of the top L values of the weight vector
Wn in descending order. Let in1 , . . . , inL denote these top-ranked cluster indices, each one a distinct
value in {1, 2, . . . , K}. Given this active set of clusters, we simply exponentiate and normalize only
at these indices.
                                      
                                      eWnk / PL       eWninℓ   if k ∈ in1 , . . . , inL
                               ∗                 ℓ=1
                             rˆnk =                                                                         (7.5)
                                      0                        otherwise.

As shown in Alg. 7.1, we can represent this solution as an L-sparse vector, with L real values
rˆn1 , . . . , rˆnL and L integer indices in1 , . . . , inL . Solutions may not be unique if the posterior weights
Wn contain duplicate values. We handle these ties arbitrarily, since swapping duplicate indices leaves
the objective unchanged.

Proof of optimality. We offer a proof by contradiction that TopLRespFromWeights solves
the optimization problem in Eq. (7.4). Suppose that rˆn′ is optimal, but there exists a pair of clusters
j, k such j has larger weight but is not included in the active set while k is. This means Wnj > Wnk ,
      ′            ′
but rˆnj = 0 and rˆnk > 0. Consider the alternative rˆn∗ which is equal to vector rˆn′ but with entries
j and k swapped. After substituting into Eq. (7.1) and simplifying, we find the objective function
                                                                         ′
value increases under our alternative: L(xn , rˆn∗ ) − L(xn , rˆn′ ) = rˆnk · (Wnj − Wnk ) > 0. Thus, the
optimal solution must include the largest L clusters by weight in its active set.

Runtime cost.        ComparingDenseRespFromWeights and TopLRespFromWeights side-by-
side yields pertinent insights to the runtime cost. The former requires K exponentiations, K addi-
tions, and K divisions to turn weights into responsibilities. In contrast, given the active set of cluster
indices in our procedure requires only L of each operation. Furthermore, finding the active indices in
can be done in O(K) via selection algorithms. Thus, for L ≪ K we find TopLRespFromWeights
to be much faster, as shown empirically in Fig. 7.1.
   Selection algorithms (Blum et al., 1973; Musser, 1997) are designed to find the top L values in
descending order within an array of size K. These methods use divide-and-conquer strategies to
recursively partition the input array into two blocks, one with values above a pivot and the other
below. Musser (1997) introduced a selection procedure which uses introspection to smartly choose
pivot values and thus guarantee O(K) worst-case runtime. This procedure is implemented within the
C++ standard library as nth element, which we use for SelectTopL in practice. This function
operates in-place on the provided array, rearranging its values so that the first L entries are all
bigger than the remainder. Importantly, there is no internal sorting within either partition. We use
nth element to find the top indices, not values, by defining a custom comparator.
                                                                                                    168


Advantages of L−sparse responsibilities. The sparsity parameter L provides a practioner
with a natural way to tradeoff between execution speed and training accuracy. When L = K,
we recover the original problem in Eq. (7.2), while L = 1 leads to assigning each observation to
exactly one cluster, as in k-means or maximization expectation (ME, Kurihara and Welling (2009))
algorithms. We call this L = 1 case “hard” assignment or “winner-take-all” assignment. Our
focus is on modest values of 1 < L ≪ K. As shown in Fig. 7.1, when the number of clusters K
measures in the hundreds or larger, with moderate L values we find that TopLRespFromWeights
is significantly faster than DenseRespFromWeights. We find that the dense method’s required
K exponentiations dominates the cost of the O(K) introspective selection procedure.
   With L−sparse responsibilities, computing the summary statistics in Eq. (7.3) scales linearly with
L rather than K. This gain is noticeable for Gaussian mixture models with unknown covariances,
which require a D × D sufficient statistic s(xn ) = xn xTn . Fig. 7.1 shows that the cost of a typical
minibatch summary step for image patch modeling with a published model with K = 200 total
clusters drops from over 10 seconds for the dense L = K procedure to less than a second for L = 4.
   In addition to the speed gains in Fig. 7.1, we also expect that sparsity may improve the overall
convergence rate of training across many global and local steps. Finally, we hope that L−sparse
responsibilities are more interpretable due to the lack of small-but-non-zero values.


7.1.3    Related work.
Using “hard” cluster assignments instead of “soft” probabilities is well-known way to balance accu-
racy for speed. K-means and its Bayesian nonparametric extension DP-means (Kulis and Jordan,
2012) justify L = 1 sparsity via small-variance asymptotics. Viterbi training (Juang and Rabiner,
1990), and maximization-expectation algorithms (Kurihara and Welling, 2009) both use L = 1 hard
assignments. However, we expect L = 1 to be too coarse for many applications.
   Few published methods allow any tuning of the sparsity-level L. Neal and Hinton (1998) intro-
duced sparse EM, a method intended for datasets where all local parameters fit into main memory.
The algorithm maintains a dense parameter vector rˆn for each observation n, but only edits a subset
of this vector during each local step. The edited subset may consist of the L largest entries or all
entries above some threshold. Any inactive entries are “frozen” to small but non-zero values, and
newly edited entries are normalized such that the whole vector rˆn preserves its sum-to-one constraint.
This approach is effective for small datasets (Ng and McLachlan, 2004), but the large memory re-
quirement limits scalability. In contrast, our approach stores only L values at each observation.
   More recently, several efforts have used MCMC samplers to approximate the local step within a
larger variational algorithm (Mimno et al., 2012; Wang and Blei, 2012a). They estimate an approx-
imate assignment posterior by averaging over many samples, where each sample is an L = 1 hard
assignment. The number of finite samples S is a crucial choice which balances accuracy and speed.
We think selecting an L value via our approach is more intuitive than choosing an S for the same
problem. Furthermore, our method provides an exact, monotonically increasing way to optimize L
while sampling improves L only in expectation.
                                                                                                                                    169


heldout log lik   2.80                      2.80                      2.80                        2.80
                         K=200                     K=400                     K=800                        K=1600             Memo L=1
                                                                                                                             Memo L=4
                  2.76                      2.76                      2.76                        2.76
                                                                                                                             Memo L=16
                  2.72                      2.72                      2.72                        2.72                       Memo L=K
                                                                                                                             Stoch L=1
                  2.68                      2.68                      2.68                        2.68                       Stoch L=4
                                                                                                                             Stoch L=16
                     100 1000 10k 100k         100 1000 10k 100k          100 1000 10k 100k          100 1000 10k 100k       Stoch L=K
                         train time (sec)          train time (sec)          train time (sec)             train time (sec)


             Figure 7.2: L-sparse mixture model results on millions of image patches.
Impact of sparsity on training zero-mean Gaussian mixture model on 3.6 million 8x8 pixel image
patches using stochastic and memoized variational inference. Training data comes from 400 total
images processed 4 images at a time. Each panel shows heldout prediction scores over time for
several training runs at a fixed number of clusters K. In all cases, runs with small L values match
or exceed the prediction quality of the dense baseline L = K in far less time. Moderate L = 4 or
L = 16 can outperform L = 1 hard assignments, as shown in the K = 200 and K = 1600 panel.


7.1.4                      Integration with scalable and adaptive proposal algorithms
Our proposed L-sparse assignment algorithm for mixture models easily fits into the scalable vari-
ational algorithms we have discussed in detail in previous chapters. Both stochastic variational
(Hoffman et al., 2013) and memoized variational algorithms (Hughes and Sudderth, 2013) from
Ch. 3 can use TopLRespFromWeights as a drop-in replacement for DenseRespFromWeights
without any other required changes except a summary step which takes advantage of the sparsity
to gain speed.


7.2                       Experimental results with L-sparse mixture models
7.2.1                      Mixture models for image patches
We now consider applying mixture models to natural images, inspired by Zoran and Weiss (2012).
We train a model for 8x8 image patches taken from overlapping regular grids of stride 4 pixels.
Each observation is a vector xn ∈ R64 , preprocessed to remove its mean. We then apply a mixture
model with concentration α = 10, using a zero-mean, full-covariance Gaussian likelihood function
with corresponding conjugate Wishart prior.
                  To evaluate, we track the log-likelihood score of heldout observations x′ under our trained model:
                                                                          PK                     ˆ k ).
                                                      score(x′n ) = log      k=1   ˆk N (x′n |0, Σ
                                                                                   π                                              (7.6)

Here, π                 ˆ k = Eq [Σk ] are point estimates computed from our trained global parameters
      ˆk = Eq [πk ] and Σ
using standard formulas. The function N is the probability density function of a multivariate normal.
                  Fig. 7.2 compares stochastic and memoized implementations of our algorithm on 3.6 million
patches from 400 images. The algorithms process 100 minibatches each with N = 36816 patches.
First, we confirm clear speed gains. With K = 800 clusters, we can process all 3.6 million patches
in about 8100 seconds with L = 4, while the dense procedure takes 27000 seconds (over 7 hours).
                                                                                                   170


Second, these speed gains do not sacrifice prediction quality. Across many values of K, we see that
sparse runs with L < K reach similar heldout scores as the dense baselines, and often much better
values as shown in the K = 800 or K = 1600 panels.


7.3      Local step algorithms for L-sparse topic models
Next, we apply our sparse approximate posterior idea to topic models. Again, for simplicity we
focus on a finite topic model known as latent Dirichlet allocation (Blei et al., 2003), but extensions
to our Bayesian nonparametric HDP topic model from Ch. 5 are quite straightforward.


7.3.1     Mean field for the LDA Topic Model
Our sparsity-level constraint naturally applies to topic models, which are hierarchical mixtures
applied to data from D documents, x1 , . . . xD containing words from a finite vocabulary of size
V . The Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003) generates a document’s
observations from a mixture model with common topics {φ}K
                                                        k=1 but document-specific frequencies
πd . This is the finite version of the infinite model presented in Fig. 5.1, though with a different,
non-hierarchical prior for p(π G ).
                           ¯ where φkv is the probability of type v under topic k. The document-
   Each topic φk ∼ DirV (λ),
                                                                   α     α
specific frequences πd are drawn from a symmetric Dirichlet DirK ( K ... K ), where α > 0 is a scalar.
Tokens are generated:

                               zdn ∼ CatK (πd ),          xdn ∼ CatV (φzdn ).                     (7.7)

The goal of posterior inference is to estimate the common topics as well as the frequencies and
assignments in any document. The standard mean-field approximate posterior over these quantities
is specified by:
                                          QNd
                               q(zd ) =      n=1   CatK (zdn |ˆ
                                                              rdn1 , . . . rˆdnK ),               (7.8)
                               q(πd ) = DirK (πd |θˆd1 , . . . θˆdK ),
                                        QK                   ˆ k1 , . . . λ
                                                                          ˆkV ).
                                q(φ) = k=1 DirV (φk |λ

Under this factorized approximate posterior, we can again set up a variational optimization objective:

                                   ˆ λ)
                          L(x, rˆ, θ, ˆ = log p(x) − KL(q||p)                                     (7.9)
                                          = Eq [log p(x, z, π, φ) − log q(z, π, φ)]


7.3.2     Local step of LDA Training Algorithm
Here, we derive an interative update algorithm for estimating the assignment factor q(zd ) and the
frequencies factor q(πd ) for a document d. Alg. 5.1 gives the conventional local step algorithm, while
Alg. 7.4 gives our novel algorithm for L−sparse assignment posteriors.
                                                                                                        171


                      a: Local Step                              b: Local Step + Restarts
              32                                            75
                        L=K                                             L=K
                        L=16                                60          L=16
              24
 time (sec)


                                               time (sec)
                        L=8                                             L=8
                        L=4                                 45          L=4
              16        L=2                                             L=2
                        L=1                                 30          L=1
               8
                                                            15

               0                                             0
                0       500     1000    1500                  0         500       1000      1500
                      num topics K                                    num topics K
                     c: Summary Step                              d: Distance from L = K

                        L=K                                 100
              0.6       L=16
                                                            80
 time (sec)


                        L=8
                                               Percentile


                        L=4                                 60
              0.4                                                                        L=16
                        L=2
                        L=1                                                              L=8
                                                            40
              0.2                                                                        L=4
                                                            20                           L=2
                                                                                         L=1
              0.0                                            0
                 0      500      1000   1500                  0.0     0.2   0.4   0.6    0.8    1.0
                       num topics K                               Variational distance
    Figure 7.3: Speed and accuracy of L-sparse assignment posteriors for training topic models
Influence of sparsity-level L on the different substeps of training a topic model from one mini-batch
of 1000 NY Times articles. L defines the number of non-zero entries in the assignment vector
rˆdn for each observation n in document d. Panel a: Timings for the local step of the LDA topic
model run for a maximum of 100 iterations at each document. With K = 500 topics, moderate
sparsity L = 8 gives a throughput of over 100 documents per second, while the dense approach can
process only 30 documents per second. Panel b: Same as a, but with additional restart proposals.
Panel c: Timings for the summary step of the topic model, which computes the effective count of
each word type in each topic. Even though this step takes little time, our sparse representations
can speed it up. Panel d: Define the empirical topic distribution of document d by normalizing
the count vector [Nd1 . . . NdK ]. For each document in the minibatch, we compute the variational
distance between the empirical distribution produced byTopLRespForDoc at a given L to that
produced by DenseStepForDoc. This plots shows the empirical CDF of this distance across all
1000 documents.

Updating θˆd , which defines q(πd ). We have a closed-form update for the document-topic pseu-
docounts θˆd given fixed assignments rˆdn .

                                         θˆdk = Ndk (ˆ
                                                     rd ) + α/K                                       (7.10)
                                                                                                      172

              PN
Here, Ndk ,     n=1 r
                    ˆdnk   counts the number of tokens assigned to topic k in document d.

Updating rˆdn , which defines q(zdn ). Under the usual dense representation, the optimal update
for the assignment vector of token n has a closed form like the mixture model, but with document-
specific weights:

                            rˆdn ← RespFromWeights                      ([Wdn1 . . . Wdnk ]),      (7.11)
                          ˆ λ)
              Wdnk (xdn , θ, ˆ , Eq [log πdk + log φkx ],
                                                      dn

                                    ˆ         PK ˆ
                 Eq [log πdk ] , ψ(θdk ) − ψ( ℓ=1 θdℓ ).

We can easily incorporate our sparsity-level constraint to enforce at most L non-zero entries in rˆdn .
TopLRespFromWeights still provides the optimal solution in this case.

Sharing parameters by word type.            Naively, tracking the assignments for document d requires
explicitly representing a separate K-dimensional distribution for each of the Nd tokens. However,
we can save memory and runtime by recognizing that for a token with word type v, the optimal
value of Eq. (7.11) will be the same for all tokens in the document with the same type. We can
thus share parameters with no loss in representational power, requiring Ud separate K-dimensional
distributions, where rˆdn , rˆdudn .

Iterative joint update for q(πd ) and dense q(zd ). When visiting a new document, we must
infer both q(πd |θˆd ) and q(zd |ˆ
                                 rd ). Following standard practice for dense assignments, we use an block-
coordinate ascent algorithm that iteratively loops between Eq. (7.10) and Eq. (7.11), as shown in
Alg. 5.1. When computing the log posterior weights Wduk , two easy speed-ups are possible: First, we
need only evaluate Cvk = Eq [log φkv ] once for each word type v and topic k and reuse the value across
iterations. Second, we can directly compute the effective log prior probability Pdk , Eq [log πdk ]
during iterations, and instantiate θˆ after the algorithm converges.
   To initialize the update cycle for a document, we recommend visiting each token n and updating
                        ′
it with initial weight Wdnk = Eq [log φkxdn ]. This lets the topic-word likelihoods drive the initial
assignments. We then alternate between Eq. (7.11) and Eq. (7.10) until either a maximum number
of iterations is reached (typically 100) or the maximum change of all document-topic counts Ndk
falls below a threshold (typically 0.05).
   Each iteration updates Pdk with cost O(K), and then performs Ud evaluations of the token-
specific responsibility update RespFromWeights, each with dense cost O(K). On most datasets,
we find local iterations are the dominant computational cost, exceeding 90% of the runtime.

Iterative joint update with sparsity.         Under the constraint that each responsibility vector has
at most L non-zero entries, we can use an alternative iterative algorithm we call TopLRespForDoc
in Alg. 5.1. This procedure has several advantages over DenseRespForDoc. First, we have already
shown in Fig. 7.1 that the cost of the subprocedure TopLRespFromWeights is much less than
                                                                                                  173


RespFromWeights. Second, we further assume that once a topic’s mass Ndk decays near zero, it
will never rise again. With this assumption, at every iteration we identify the set of active topics
(those with non-neglible mass) in the document: Ad , {k : Ndk > ǫ}. Only these topics will have
weight large enough to be chosen in the top L for any token. Thus, throughout TopLRespForDoc
we need only loop over the active set, and each iteration costs O(|Ad |) instead of O(K).
   Discarding topics whose mass within a document drops below ǫ is justified by previous empirical
observations of the so-called “digamma problem” described in Mimno et al. (2012): for topics with
                                                                                              α
negligible mass, the expected log probability term becomes vanishingly small. For example, ψ( K )≈
−200 for α ≈ 0.5 and K ≈ 100, and gets smaller as K increases.
   In practice, after the first few iterations the active set stabilizes and each token’s top L topics
rarely change while the relative responsibilities continue to improve. In this regime, we can amortize
the cost of TopLRespForDoc by avoiding selection altogether, instead just reweighting each to-
ken’s current set of top L topics. We perform selection for the first 5 iterations and then only every
10 iterations, which yields large speedups without loss in solution quality.
   Fig. 7.3 compares the runtime of TopLRespForDoc across values of sparsity-level L against a
heavily-optimized version of DenseRespForDoc, which includes precomputing the exponentiated
conditional likelihoods and uses optimized dense-matrix multiplication routines. For small numbers
of topics K < 100, our gains are modest at best. However, for K ≫ 100 our sparsified algorithm
can process 1000 documents at least three times faster, with relative speed increasing with K.

Restart proposals. Our previous work in Sec. 5.3.4 (Hughes et al., 2015a) introduced restart
proposals, a post-processing step to single document inference that can dramatically improve perfor-
mance. Given the output count vector Nd from Alg. 5.1, the restart proposal constructs a candidate
Nd′ by setting an active entry of Nd to zero and then running two iterations forward. We then
accept the new count vector if it improves the objective L. We find these proposals are crucial
to escape local optima, so we always include them. A side-by-side comparison of costs with and
without restarts is in Fig. 7.3.


7.4     Experimental results with L-sparse topic models
7.4.1     Topic Modeling Experiments
We train topic models at several sparsity-levels using Memoized and Stochastic algorithms. We
compare to the fast-yet-exact SparseGibbs (sometimes called SparseLDA) sampler of Yao et al.
(2009) and the sampler-inside-stochastic GibbsInSVI method of Mimno et al. (2012). These exter-
nal methods use Java code available in Mallet (McCallum, 2002). External methods use the default
Mallet initialization, while we use AnchorWords (Arora et al., 2013) to initialize q(φ).

Sparsity Hyperparameters.          Our primary goal is quantifying how our performance varies with
the sparsity-level L. GibbsInSVI allows control of the number of samples S used to approximate
                                                                                                174


Algorithm 7.3 Update for document-specific responsibilities under standard topic model
Input:
    α : document-topic smoothing scalar
    {{Cvk }K      V
             k=1 }v=1 : log prob. of word v in topic k
                    Cvk , Eq [log φkv ]
    {vdu , cdu }U
                u=1 : word type/count pairs for doc. d
Output: rˆd : dense responsibilities for doc d
 1: function DenseRespForDoc(C, α, vd , cd )
 2:     for u = 1, . . . U do
 3:         rˆdu = RespFromWeights(Cvdu )
 4:     while not converged do
 5:         for k = 1, P 2 . . . K do
                                                      α
 6:              Ndk = u cdu rˆduk , Pdk = ψ(Ndk + K    )
 7:       for u = 1, 2 . . . U do
 8:          for k = 1, 2 . . . K do
 9:               Wduk = Cvdu k + Pdk
10:          rˆdu = RespFromWeights(Wdu )                                             ⊲ See Alg. 7.1
11:    return rˆd
Algorithm for computing responsibilities for a single document given fixed topics for a multinomial
likelihood and bag-of-words data. Without sparsity constraints, each step scales with O(K).


Algorithm 7.4 Update for document-specific responsibilities under L-sparse topic model.
Input:
    α : document-topic smoothing scalar
    {{Cvk }K      V
             k=1 }v=1 : log prob. of word v in topic k
                    Cvk , Eq [log φkv ]
    {vdu , cdu }U
                u=1 : word type/count pairs for doc. d
Output: rˆd : L−sparse responsibilities for doc d
 1: function TopLRespForDoc(C, α, vd , cd , L)
 2:     for u = 1, . . . U do
 3:         rˆdu = TopLRespFromWeights(Cvdu , L)                                  ⊲ See Alg. 7.2
 4:     while not converged do
 5:         for k ∈ AdP   do
 6:              Ndk = U   u=1 cdu r
                                   ˆduk
 7:       Ad = {k ∈ Ad : Ndk > ǫ}
 8:       for k ∈ Ad do
                            α
 9:          Pdk = ψ(Ndk + K  )
10:       for u = 1, 2 . . . U do
11:          for k ∈ Ad do
12:               Wduk = Cvdu k + Pdk
13:          rˆdu = TopLRespFromWeights(Wdu , L)
14:    return rˆd
Algorithm for computing responsibilities for a single document given fixed topics for a multinomial
likelihood and bag-of-words data. Accomplishes the same goal as RespForDoc but forces each
observation to use at most L topics. This change, plus tracking only the active topics in a document
Ad leads to much faster iterations than the conventional RespForDoc.
                                                                                                                         175


                                           NIPS D=1392                                        Wiki D=7961
                                               K=400


                     heldout log lik


                                                                        heldout log lik
                                                                                                  K=400
                                       −7.55                                              −7.74
                                                                                                                 L=K
                                                           L=K                            −7.80
                                                                                                                 L=16
                                       −7.60               L=16
                                                                                                                 L=4
                                                           L=4                            −7.86
                                                                                                                 L=1
                                                           L=1
                                       −7.65                                              −7.92
                                                100    1000   10k                                  100    1000     10k
                                               train time (sec)                                   train time (sec)

              Figure 7.4: Comparison of values for sparsity-level L on topic models
Analysis of 1392 NIPS articles (left ) and 7961 Wikipedia articles (right ). These trace plots of heldout
likelihood over time represent an internal comparison of different possible L values for memoized
inference. We find that hard L = 1 assignments can plauteau early or fail catastrophically due to
the many local optima in multinomial topic models. In contrast, moderate L values like 4 or 16
produce the same high-quality predictions as dense L = K representations in a fraction of the time
(note the log-scale of the x-axis).


q(zd ). We consider S = {5, 10}, always discarding half of these samples as burn-in. SparseGibbs
has no sparsity parameter.

Nuisance Hyperparameters. For all methods, we set document-topic smoothing α = 0.5 and
                     ¯ = 0.1. We set the stochastic learning rate at iteration t to ρt = (δ + t)−κ .
topic-word smoothing λ
We use grid search to find the best heldout score on validation data, considering delay δ ∈ {1, 10}
and decay κ ∈ {0.55, 0.65}.
   We evaluate all methods on two key metrics: wallclock time and predictive power. Following
Wang et al. (2011), we measure heldout performance via a document completion task. Given a
heldout document xd , we divide its words at random by type into two pieces: 80% in xA
                                                                                     d and 20%
in xB
    d . We use subset A to estimate the document-specific probabilities π
                                                                        ˆd , and then evaluate the
predictions of this estimate on the remaining words in B. Throughout, we point estimate each topic
k to the trained posterior mean φˆk = Eq [φk ]. Across many heldout documents, we measure the
following score:
                                                                  P P|xB
                                                                       d |
                                                                                               P
                                                                    d    n=1              log k π   ˆdk φˆkxB
                                                   A   Bˆ =
                                          score(x , x , φ)                                P B               dn

                                                                                            d |xd |

For all algorithms, we fix a point estimate of topics φˆ from training and then estimate π
                                                                                         ˆd in the same
way: finding the optimal q(πdA ) and q(zdA ) for the words in the first piece xA
                                                                               d by using DenseResp-
ForDoc. Finally, we take πˆd = Eq [πdA ] and compute the heldout likelihood of xB
                                                                                d .
   We first conduct careful comparisons on datasets which are small enough for all algorithms to
make many complete laps (effective passes through the dataset). The first two rows of Fig. 7.5 show
performance on 1392 NIPS articles and 7961 Wikipedia articles. The final row shows performance
for the much larger New York Times Annotated Corpus: 1.8 million articles from 1987 to 2007, with
9000 documents per batch (200 batches total). Across all datasets, we reach the same conclusions:
                                                                                                                                 176


                           K=200                    K=400                  K=800                  Memo L=K
heldout log lik
NIPS D=1392


                   −7.5                      −7.5                  −7.5
                                                                                                  Memo L=4
                                                                                                  Stoch L=K
                   −7.6                   −7.6                     −7.6
                                                                                                  Stoch L=4
                                                                                                  SparseGibbs
                   −7.7                      −7.7                  −7.7                           GibbsInSVI S=5
                                                                                                  GibbsInSVI S=10
                            100 1000 10k             100 1000 10k           100 1000 10k
                          train time (sec)      train time (sec)          train time (sec)
                   −7.7                   −7.7                     −7.7
                           K=200                 K=400                    K=800                  Memo L=K
heldout log lik
Wiki D=7961


                                                                                                 Memo L=4
                                                                                                 Stoch L=K
                   −7.8                   −7.8                     −7.8                          Stoch L=4
                                                                                                 SparseGibbs
                                                                                                 GibbsInSVI S=5
                   −7.9                   −7.9                     −7.9                          GibbsInSVI S=10
                            100 1000 10k            100 1000 10k           100 1000 10k
                          train time (sec)      train time (sec)        train time (sec)
NY Times D=1.8M


                   −7.5                  −7.5                    −7.5                  −7.5                         Memo L=K
                          K=200                 K=400                   K=800                 K=1600
 heldout log lik


                                                                                                                    Memo L=8
                   −7.6                  −7.6                    −7.6                  −7.6
                                                                                                                    Memo L=4
                   −7.7                  −7.7                    −7.7                  −7.7                         Stoch L=K
                                                                                                                    Stoch L=8
                   −7.8                  −7.8                    −7.8                  −7.8                         Stoch L=4
                                                                                                                    GibbsInSVI S=5
                   −7.9                −7.9                −7.9                −7.9
                                                                                                                    GibbsInSVI S=10
                      100 1000 10k 100k   100 1000 10k 100k   100 1000 10k 100k   100 1000 10k 100k
                      train time (sec)        train time (sec)      train time (sec)       train time (sec)


           Figure 7.5: L-sparse topic model results on NIPS, Wikipedia, and NYTimes
Analysis of 1392 NIPS articles (top row ), 7961 Wikipedia articles (middle), and 1.8 million New
York Times articles (bottom). We use 5 batches for the small datasets, and 200 batches for NY
Times. Each panel plots heldout scores (higher is better) over time for sparse and dense versions of
our algorithms and external baselines. The number of clusters K varies from left to right.
                                                                                                 177


Moderate sparsity is best. Throughout Fig. 7.5, we see that runs with sparsity-levels of L = 4,
L = 8, or L = 16 under both memoized and stochastic algorithms converge several times faster than
L = K but yield indistinguishable predictions.

Hard L = 1 assignments can be pathological. As shown in Fig. 7.4, running memoized
inference with L = 1 may either plateau early at noticeably worse performance (e.g. the left panel
plot using the NIPS dataset) or fall into progressively worse local optima (e.g. the right panel plot
using the Wiki dataset). Remember, neither memoized nor stochastic inference for LDA topic models
has a monotonicity guarantee because both q(zd ) and q(πd ) are re-estimated from scratch each time
we visit a document. When L = 1 hard assignments, the local step inference can have a strong
tendency to use many different topics to explain a document given a likelihood-first initialization
of the local coordinate ascent iterations. This can explain the pathological behavior of L = 1
assignments.

External methods converge slowly. Throughout Fig. 7.5, no run of SparseGibbs or Gibb-
sInSVI reaches competitive predictions in the allowed time limit (3 hours for NIPS and Wiki, 2 days
for NYTimes). GibbsInSVI benefits from more samples (S = 10 > S = 5) only on NYTimes. More
than 10 samples did not improve performance further. Mimno et al. (2012) also found diminishing
returns from tuning the number of MCMC samples S used to estimate expectations.


7.5     Discussion
In this chapter, we have introduced a simple sparsity constraint on the parameters of a variational
posterior which leads to faster training times, equal or better predictive power, and more intuitive
interpretation. Our algorithms can be dropped-in to any ML, MAP or full-posterior variational
clustering objective and are simple to parallelize and scale to millions of examples.
   We anticipate further research in adapting L > 1 sparsity to sequential models like HMMs,
which might lead to a dynamic programming algorithms for sequential models that might scale
quadratically with L rather than K. This would speed up the largest existing bottleneck in using
the training algorithms from Ch. 6.
   We also anticipate applying these ideas to other factors in the approximate posterior. For exam-
ple, for multinomial topic models we might directly constrain the shape parameters q(φk ) so that
at most W vocabulary words have non-zero probability. We could also maybe encode some relax-
ation of an anchor-word assumption (Arora et al., 2013). These kinds of constraints could improve
interpretability as well as speed.
Chapter 8

Recommendations

Throughout this thesis, we have presented a unified framework for training Bayesian nonparametric
probabilistic models models via variational optimization algorithms. In this chapter, we discuss
four possible extensions to our work which would make it applicable to a much broader family of
modeling scenarios and improve overall effectiveness. We review each recommendation for future
work below, with further detailed discussion in later sections.

Sec. 8.1: Parallelization for extreme scalability.          Scaling our optimization algorithms up
to billions of examples presents several technical challenges. Several of our baseline algorithms
for fixed-truncation inference are “embarrassingly” parallelizable. Creating parallelized implemen-
tations of birth, merge, and delete proposal moves remains an engineering challenge, though the
general task seems feasible. We hope that parallel implementations together with possible feed-
forward approximations to the local step provide a path to reliable clustering on huge datasets that
can still adequately solve the model selection problem.

Sec. 8.2: Approximation-quality guarantees for variational optimization. Both the Breg-
man divergence k-means++ initialization algorithm in Alg. 2.2 (Ackermann and Bl¨
                                                                               omer, 2010) and
the original k-means++ algorithm of Arthur and Vassilvitskii (2007) offer approximation-quality
guarantees relative to the globally-optimal clustering which are quite good. Spectral methods have
also recently drawn attention as provable methods for point estimation in related clustering problems
(Hsu et al., 2012). To our knowledge, no work exists in examining whether these guarantees would
apply for the type of variational optimization objectives we explore in this thesis. Providing such
a guarantee for either some initialization procedure, or more ideally for some proposal procedures,
could make our optimization approach much more attractive.

Sec. 8.3: Semi-supervised clustering. Many clustering applications come with some additional
side information, such as the five-star ratings that accompany Yelp reviews or the low-level and
high-level action labels from motion capture datasets. We envision coherent methods for training

                                                 178
                                                                                                  179


large-scale BNP models from the union of small curated datasets with supervised labels and millions
of unlabeled data observations.

Sec. 8.4: Extension to probabilistic programming. The adaptive-proposal optimization
methods we have presented for mixture, topic, and hidden Markov model should be extensible
to a broad family of Bayesian nonparametric clustering models. We envision a model specification
language for this family as well as automatic inference in the spirit of probabilistic programming. We
hope our practical BNPy software could evolve into this system, supporting a narrower set of models
than alternatives like Stan (Carpenter et al., 2015) but offering scalable and reliable inference for
this specialized family.


8.1     Parallelization and other tricks for extreme scalability
Two possible ways to speed up our variational optimization for Bayesian nonparametric clustering
models for scaling to billions of examples are: finding faster algorithms (in terms of order-of-growth
runtime as well as practical speed) and deploying these algorithms in parallelized hardware. For
example, the L-sparse approximations from Ch. 7 led to much faster algorithms by reducing some
substeps from O(K) runtime cost to O(L). We first discuss parallelized implementation of our ex-
isting algorithms, identifying bottlenecks and opportunities. We later mention some approximations
we can make to the local step of optimization which may yield substantial speed improvements.
   Taking advantage of parallelization is a must for an algorithm to gain traction on industry-
scale applications. We have preliminary success scaling our existing fixed-truncation algorithms
for topic models and hidden Markov models to the single-machine, multiple-cores setting. Fig. 6.6
shows over 25x speed improvement when using 64 cores on data at the scale of the human genome.
We are excited about performing adaptive birth, merge, and delete proposals inference in a truly
distributed setting, with dozens of parallel workers each creating new clusters from disjoint batches
and integrating them into a coherent centralized model. Some early work on merge proposals in
this multiprocessor setting has been done by Campbell et al. (2015) for an alternative variational
approach to HDP topic models. Implementing such parallelization into BNPy would make our
clustering methods accessible to many industry-scale applications.
   Procedurally, the bottleneck of inference across all our models is the runtime cost of the “lo-
cal” step, where we assign data to clusters. Rosenbaum and Weiss (2015) studied the image patch
modeling application from Fig. 1.2 and showed that applying a pre-trained Gaussian mixture model
to test images can be dramatically accelerated by a feed-forward approximation of the local step.
Other recent approaches (Mnih and Gregor, 2014) combine the fast, feed-forward properties of neu-
ral networks during training itself, optimizing the feed-forward network as part of the the Bayesian
variational optimization problem. By training a feed-forward network to approximate the local pos-
terior of q(zn ), as done by Gan et al. (2015), these methods use information from previous examples
to cluster new data faster. With some of these tricks, our memoized algorithm could train models
                                                                                                  180


with thousands of clusters and billions of examples.


8.2         Approximation guarantees for variational optimization
8.2.1       Distance-biased random initializations with guarantees
Our Bregman k-means++ initialization in Alg. 2.2 is motivated by the k-means++ algorithm
Arthur and Vassilvitskii (2007), with the extension to Bregman divergences motivated by Theo-
rem 1 of Ackermann and Bl¨
                         omer (2010). Formally, Ackermann and Bl¨
                                                                omer (2010) prove that if a
dataset satisfies basic separability conditions and the Bregman divergence function can be bounded
by a Mahalonobis distance, then a k-means++ initialization yields a constant-factor approximation
to the solution quality with probability 2−O(K) . Some of this work appears in an earlier confer-
ence paper (Ackermann and Bl¨
                            omer, 2009). Developing extensions that justify this procedure for all
Bregman divergences, not just those for which the Mahalonobis distance bound applies, is of great
interest.
   Acharyya et al. (2013) show how to generalize any Bregman divergence to a “general symmetric”
divergence function that is symmetric and obeys the triangle inequality. They prove kmeans++
style guarantees for this class of “general symmetric” divergences. This generalization voids the
interpretation of our procedure in Alg. 2.2 as solving a MAP estimation problem for exponential
families. However, such general properties may prove useful for achieving formal guarantees.

More scalable guarantees

The k-means++ initialization requires K total passes through the full dataset, or whatever subset
is used for initialization. In each pass, exactly one additional cluster mean is chosen. Each step
is conditioned on results of the previous pass, so the procedure isn’t easily parallelized. When the
number of clusters K is large, in the hundreds or thousands, this can be quite expensive.
   Bahmani et al. (2012) presents an alternative initialization algorithm with similar performance
guarantees but much cheaper initialization. Their procedure requires only a small constant (≈ 5)
number of passes through the dataset in practice, instead of K total passes. The basic idea is that
we commit to doing R passes, and at each round we select each data observation xn as a possible
cluster center with probability:

                                          pn ∝ LD(xn , µn )                                      (8.1)

where L is a user defined positive scalar. After all the R rounds, the total number of chosen clusters
will be a random variable, though its expected value can be controlled by the parameter L as detailed
in Bahmani et al. (2012). If exactly K clusters are needed, we can apply post-processing heuristics
to select these.
   This sampling technique is easy to generalize to our Bregman divergences case in practice. Show-
ing that the guarantees apply requires sophisticated theoretical arguments to get around the triangle
inequality.
                                                                                                   181


8.2.2     Provable spectral algorithms
Another prominent line of recent research are so-called spectral methods. These approaches exist
for finite mixture models, topic models (Arora et al., 2013), hidden Markov models (Hsu et al.,
2012), and probabilistic context-free grammars (Cohen et al., 2014). The primary advantage of these
methods is that they deliver very fast point estimates of global parameters. Often such methods
come with formal guarantees of solution quality, but these require two assumptions. First, they
assume infinitely many training examples. Second, they assume that the data is truly generated
by the underlying likelihood model. Guarantees like those in Arora et al. (2013) do note exploit
regularization from the prior or give any quality guarantees when the model is an approximation of
the true generating process. Nevertheless, finding ways to more closely connect these methods with
our Bayesian nonparametric approach could yield promising theoretical results, and we have found
such procedures useful as initializations for our methods in practice.


8.3     Extensions for semi-supervised clustering
In many applications, supervised information like the number of stars accompanying an online review
or the number of links to some Wikipedia article may be of interest. Often, supervised information
is expensive to obtain, so we seek approaches that can easily handle joint learning of latent cluster
representations when only a subset of the data has observed response variables. This can include
cases where every data token has a response we’d like to predict, or cases where groups of tokens
have a single response (such as labels for whole documents or entire sequences).
   Throughout this section, we’ll make discussion concrete by fixating on a particular example:
prediction of document-level labels via a HDP topic model. Here, each document d has one observed
response variable yd . The response yd may be binary, multi-class, or real-valued in practice, though
each choice requires slightly different inference machinery. We’ll take it to be real-valued throughout
this running example, which implies our intended task is regression rather than classification. Our
goal is to find clusters or topics that predict both the observed data xd and this document-level
response variable yd .
   We consider two conceptual approaches for achieving our goal of finding latent clusters that
predict response variables. The first conceptual approach comes from prediction-focused generative
models. This tradition treats response variables like any other random variable in the graphical
model. For our target task of document-level regression, the most prominent example is sLDA –
supervised Latent Dirichlet Allocation (Mcauliffe and Blei, 2008) – which models response variable
yd as a child of topic indicators zd . Extensions have modified this parametric model to a proper
BNP model via the HDP (Zhang et al., 2013) using fixed-truncation variational inference.
   The second conceptual tradition is maximum entropy discrimination (MED). This approach takes
a more purely discriminative view that integrates topic models with the machinery behind support
vector methods, which are widely used for classification and regression. MED methods define specific
margin constraints which a proposed cluster-to-response prediction model must satisfy. For the topic
                                                                                                                 182


modeling problem, these constraints force the hidden clusters to predict the response yd within some
tolerance, and encourage big changes to the topic assignments of any documents that violate these
constraints. Zhu et al. (2012) proposed a supervised topic model in this tradition called MED-
LDA, inspired by Jebara’s earlier work with more general prediction problems (Jebara, 2001). The
original MED-LDA used variational inference, while recent efforts have developed a data-augmented
Gibbs sampler for parametric topic models (Zhu et al., 2013) and Bayesian nonparametric HMMs
(Zhang et al., 2014). Below, we show how both the sLDA and MED-LDA variational approaches
are special cases of a loss-augmented version of our original variational objective L. This view allows
us to compare and contrast the two approaches more closely.
   Alternative upstream generative approaches are not considered here. These methods do not align
with our focus on predictive performance, but still pursue label-informed clustering for text data
(Mimno and McCallum, 2008; Lacoste-Julien et al., 2009), relational models (Kim et al., 2012), and
visual scene categorization (Li et al., 2009). Upstream models tell a generative story in which labels
yd are the parents of topic assignments zd , rather than the opposite story told by downstream models
like sLDA. Simply put, these models are not set up for predicting labels. In fact, under these models
both imputing or marginalizing away missing labels can be computationally difficult. Nevertheless,
these models offer interesting ways to inform latent representations when metadata is available, and
could be a direction for future work.

Possible variational optimization objectives for supervision

Adding supervision to the topic model from Ch. 5 requires specifying the prediction rule for producing
response variable yd from the assignments zd . Following Mcauliffe and Blei (2008), we choose a
simple linear prediction rule given approximate posterior q(zdt ) at each token t in document d
parameterized by responsibility vector rˆdt .

                                                              1
                                                                  PTd
                        yd = wT E[¯
                                  zd ],            zd ] ,
                                           where E[¯         Td    n=1 [ˆ
                                                                        rdt1   rˆdt2 . . . rˆdtK ]              (8.2)

Here, z¯d is a K-length vector of empirical topic frequencies in document d. Each vector zd has
positive entries that sum to one. The parameter w describes a linear transformation from vector z¯d
to scalar yd , where w is a K-length vector of weight coefficients.
   Extending our general unsupervised objective L for topic models from earlier chapters to incor-
porate this prediction model for response variables yd , we arrive at the following general form:

             ˆ τˆ, νˆ, u                    ˆu                                        PD
  Ls (w, rˆ, θ,           ˆ ) = Lalloc (ˆ
                       ˆ, ω             r , θ,    ˆ ) + Ldata (x, rˆ, τˆ, νˆ) + R(w) − d=1 Loss(yd , wT E[z¯d ]) (8.3)
                                               ˆ, ω

We still seek to maximize this general supervised objective Ls by finding the ideal free parameters
    ˆ τˆ, νˆ, u
rˆ, θ,        ˆ, ω
                 ˆ as before. However, by incorporating supervised labels this optimization now requires
finding cluster indicators that explain the data x and minimize the loss in predicting the responses yd
at each document via the cluster indicators. Now, specifying our objective requires both a concrete
loss function Loss and a choice of regularization R on the weight vector. Both sLDA and MED-LDA
are special cases of this objective, each with a specialized choice of loss function.
                                                                                                   183


sLDA objective.       Prediction-focused generative models specify a loss function for the response
variable yd . When yd is real-valued, a natural model is a simple Gaussian yd ∼ N (wT E[zd ], λ−1 ),
which implies the following form of the loss function
                                                                       λ
                   LosssLDA (yd , wT E[zd ]) , − log p(yd |w, zd ) =     (yd − wT E[zd ])2        (8.4)
                                                                       2
Here, λ > 0 is a scalar precision: larger values imply more precise estimates of yd .
   The major problem with this objective is that it forces learned clusters to explain both the
observed data xd and the response yd , with no clear encoding of how to manage the tradeoff. When
the number of observed tokens Td in a document is large, this can be problematic. The learned
topics must provide good explanations for Td tokens and predict yd with precision λ. Without
carefully setting λ, the clusters can easily favor configurations that explaing the tokens well but not
the response, since the objective weights tokens over response predictions by a factor of (roughly)
Td /λ. Thus, tuning the parameter λ requires careful attention for good performance.

MED-LDA objective.            For regression, MED-LDA employs an ǫ−insensitive loss function:

                    LossMED−LDA (yd , wT E[zd ]) , C max(0, |yd − wT E[zd ]| − ǫ)                 (8.5)

Here, ǫ is a free parameter that specifies tolerance to error. Setting ǫ = 0.1 on the Yelp example
would encode that errors of less than 0.1 on the 5 star scale should not be penalized. No such free
parameter exists in the sLDA objective. C > 0 is a scalar cost parameter. Like λ above, it must be
tuned carefully during training.
   We can alternatively express the unconstrained objective Ls for MED-LDA as a constrained
optimization problem, with a constraint for every document requiring that that |yd − wT E[¯
                                                                                          zd ]| ≤
ǫ + ξd . This constraint enforces the predictions to lie within tolerance ǫ. Slack variables ξd allow
some predictions to exceed this desired tolerance since a perfect model may not be possible, but
penalize this transgression via increased loss.
   Comparing the MED-LDA objective with sLDA, we see that the MED-LDA approach more
naturally lends itself to the goal of strong predictions. Both MED-LDA free parameters ǫ and C are
directly interpreted as a desired tolerance and a penalty for violating this tolerance. In particular,
this conceptual view has no trouble with tuning C differently as more data arrives. However,
with sLDA the idea of tuning λ to match predictive performance runs contrary to the generative
approach. Furthermore, the generalization performance associated with the MED machinery brings
further potential benefits.
   Inference for the MED-LDA objective above is quite tractable. The updates to all our earlier
variational free parameters do not change at all, except for the cluster indicator parameters rˆd .
Finding the optimal weight vector w can seen as a solution to a quadradic program when function
R(w) encodes an L2 penalty. Numerical solvers for these kind of QPs are well-studied. For improved
regularization, we may seek free parameters for an propor variational distribution q(w), rather than
a simple point estimate of w. This is the approach taken by Zhu et al. (2012) and will be essential
for properly comparing models of different truncation levels.
                                                                                                        184


                                              !∞
                                                      π0k = 1              0            zn
         πzpa(n)                        π01
                                          π02
                                                k=1

                                                                                     N tokens
                                 π0                π0K
                                                         π0,>K   C1
                                                ···
                          B1          cluster probabilities
                                                                               d           zdn
            zn                              π0
                                                                 C2   D documents Nd tokens

                                                                       0       zn1 zn2 · · · znT
                                 π1    π2       ···        πJ
           xn             B2                                     C3   N sequences

                                            π0                                              zn4 · · ·
                                                                                            zn5 · · ·
            φk                                                         0       zn1
                                                                                     zn2
                                 π1      π2     ···        πJ                        zn3    zn6 · · ·
  A      clusters         B3                                                                zn7 · · ·
                                                                 C4   N trees

        Figure 8.1: A probabilistic programming language for specifying clustering models
Our compositional view of clustering models. Col. A: Generative model for one data token xn .
Col. B: Possible dependency graphs for cluster probability vectors π: DP (top), HDP (middle), and
dependent DPs (bottom). Col. C: Possible graphs for cluster indicators z.


   The experiments of Zhu et al. (2012) clearly indicate the need for a method that can adapt the
truncation level. Several experiments studying predictive performance as a function of the number
of topics show unsurprisingly that using too few topics can severely handicap predictions, while
too many topics can cause slightly worse predictions while requiring more computation. Methods
which can reliably explore the combinatorial space of clusterings via non-local changes will be quite
important for finding interpretable models that have good predictive power.


8.4     Extensions with probabilistic programming
A prominent trend in modern machine learning research is the development of powerful, general-
purpose representation languages called probabilistic programming languages (PPLs) These lan-
guages allow the user to specify complex probabilistic models as a generative processs, and then auto-
matically perform inference for the specified model given data. PPLs such as Stan (Carpenter et al.,
2015), WebPPL (Goodman and Stuhlm¨
                                 uller, 2014), and Venture (Mansinghka et al., 2014) vary in
the domain of possible models but generally use a version of MCMC sampling or particle methods
to perform inference. CrossCat (Mansinghka et al., 2015) is a recent example of using probabilistic
programming for Bayesian nonparametric models of data tables.
   We have a vision for a PPL specific to a broad family of clustering models which includes mixtures,
topic models, and hidden Markov models as well as many more possibilities.


8.4.1    A preliminary model specification language
Our preliminary model specification language is illustrated in Fig. 8.1. As in the models discussed
in this thesis, the unifying property of all models our language supports is that each data token xn
                                                                                                     185


is generated from a single cluster indicated by discrete variable zn ∈ {1, 2, . . . K, . . .}. The chosen
cluster zn = k specifies the observation model for data xn . If zn = k, we draw xn from an exponential
family (EF) likelihood density with natural parameter φk : p(xn |φk ) ∝ exp[s(xn )T φk ], where s(xn )
is a sufficient statistic for the observation model, as detailed in Ch. 2.

Specifying an allocation model. The allocation of cluster assignments z requires a set of fre-
quency vectors π and indicators z. The number of assignment variables zn is exactly equal to the
number of data tokens. The number of πj variables varies from model to model, with each vector πj
a non-negative vector that sums to unity with an entry for each cluster. By choosing a fixed graph
structure for π and z, we encode structural assumptions into the model, as shown in Fig. 8.1.
   Each assignment zn is drawn from a frequency distribution over clusters defined by exactly one
πj node. The value of the parent of node zn in the z graph determines which πj variable is used to
generate zn :

                                      p(zn = k|zpa(n) = j) = πjk                                   (8.6)

This requires each zn variable to have exactly one parent, so the topology of the z graph must be
restricted to trees (or multiple trees). However, the π graph may be any directed a-cyclic graph.
   Our framework defines a single allocation model by combining fixed graph structures for π, z
from columns B and C. Each model can be either parameteric or nonparametric, based on the
prior distribution of the top-level π G . The pair (B1,C1) yields mixture models (Blei and Jordan,
2006), while (B2, C2) gives topic models (Blei et al., 2003; Teh et al., 2006), and (B2, C3) gives
hidden Markov models (Beal et al., 2001). The pair (B2, C4) yields hidden Markov trees used for
multi-scale image modeling (Crouse et al., 1998; Kivinen et al., 2007) and text parsing (Finkel et al.,
2007; Liang et al., 2007). B3 and C2 could yield a topic model where frequencies vary over time, as
in (Blei and Lafferty, 2006). This framework also extends to relational block models (Airoldi et al.,
2009; Kemp et al., 2006), hierarchical hidden Markov models (Heller et al., 2009), and spatial models
for image segmentation (Sudderth and Jordan, 2009).


8.4.2      Towards general-purpose inference
As a long-term goal, we wish to develop more general-purpose inference methods. Our current code
base requires custom implementation of each allocation model (combination of π and z graphs),
which makes reusing pieces more difficult than we would like. Therefore, we will investigate de-
veloping general-purpose inference algorithms that are modular with respect to the compositional
structure of Fig. 8.1, developing a message-passing framework that can handle general π graphs and
z trees.
   One considerable challenge here is developing scalable and adaptive inference methods that are
more general purpose. For example, we might wish to handle a desired likelihood p(xn |φk ) that
does not belong to the exponential family. Toward this end, general purpose “black-box” varia-
tional methods (Kucukelbir et al., 2015; Ranganath et al., 2014) could inspire new solutions when
                                                                                                186


combined with the adaptive proposals from this work. There are also possible connections to recent
work on variational auto-encoders (Kingma and Welling, 2014). These methods are possible even
with models for which the usual mean-field variational objective function is intractable to compute.
We are optimistic that with significant research investment, we could develop reliable and scalable
general-purpose inference for a wide range of possible Bayesian nonparametric clustering models.
Bibliography

S. Acharyya, A. Banerjee, and D. Boley. Bregman divergences and triangle inequality. In SIAM International
  Conference on Data Mining, 2013.

M. R. Ackermann and J. Bl¨
                         omer. Coresets and approximate clustering for Bregman divergences. In Pro-
  ceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’09), 2009.

M. R. Ackermann and J. Bl¨
                         omer. Bregman clustering for separable instances. In Proceedings of the 12th
  Scandinavian conference on Algorithm Theory (SWAT), 2010.

A. Agarwal and H. Daum´e III. A geometric view of conjugate priors. Machine Learning, 81(1):99–113, 2010.

E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In
  Neural Information Processing Systems, 2009.

C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning.
  Machine Learning, 50(1-2):5–43, 2003.

P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.
  IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.

S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm
  for topic modeling with provable guarantees. In International Conference on Machine Learning, 2013.

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In ACM-SIAM Symposium
  on Discrete Algorithms, 2007.

B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. Proceedings of
  the VLDB Endowment, 5(7):622–633, 2012.

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of
  Machine Learning Research, 6:1705–1749, 2005.

M. J. Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, University of London,
  2003.

M. J. Beal and Z. Ghahramani. Variational inference for bayesian mixtures of factor analysers. In Neural
  Information Processing Systems, 1999.


                                                   187
                                                                                                        188


M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The infinite hidden Markov model. In Neural Information
  Processing Systems, 2001.

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):
  121–143, 2006.

D. M. Blei and J. D. Lafferty. Dynamic topic models. In International Conference on Machine Learning,
  2006.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,
  3:993–1022, 2003.

M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of
  Computer and System Sciences, 7(4):448 – 461, 1973.

C. A. Bouman and M. Shapiro. A multiscale random field model for Bayesian image segmentation. Image
  Processing, IEEE Transactions on, 3(2):162–177, 1994.

T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In
  Neural Information Processing Systems, 2013.

M. Bryant and E. B. Sudderth. Truly nonparametric online variational inference for hierarchical Dirichlet
  processes. In Neural Information Processing Systems, 2012.

T. Campbell, J. Straub, J. W. Fisher III, and J. P. How. Streaming, massively parallel variational inference
  for bayesian nonparametrics. In Neural Information Processing Systems, 2015.

B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li,
  and A. Riddell. Stan: a probabilistic programming language. Journal of Statistical Software, 2015.

G. Casella and C. P. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.

J. Chang and J. W. Fisher III. Parallel sampling of dp mixture models using sub-cluster splits. In Neural
  Information Processing Systems, 2013.

J. Chang and J. W. Fisher III. Parallel sampling of HDPs using sub-cluster splits. In Neural Information
  Processing Systems, 2014.

S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and L. Ungar. Spectral learning of latent-variable PCFGs:
  Algorithms and sample complexity. The Journal of Machine Learning Research, 15(1):2399–2449, 2014.

M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-based statistical signal processing using hidden
  Markov models. IEEE Transactions on Signal Processing, 46(4):886–902, 1998.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM
  algorithm. Journal of the Royal Statistical Society, Series B, pages 1–38, 1977.
                                                                                                         189


J. Ernst and M. Kellis. Discovery and characterization of chromatin states for systematic annotation of the
  human genome. Nature Biotechnology, 28(8):817–825, 2010.

M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the
  American Statistical Association, 90(430):577–588, 1995.

B. S. Everitt. Finite mixture distributions. Wiley Online Library, 1981.

T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230,
  1973.

J. R. Finkel, T. Grenager, and C. D. Manning. The infinite tree. In Proc. of the Annual Meeting of the
  Association for Computational Linguistics, 2007.

N. Foti, J. Xu, D. Laird, and E. Fox. Stochastic variational inference for hidden Markov models. In Neural
  Information Processing Systems, 2014.

E. B. Fox. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. PhD thesis, Massachusetts
  Institute of Technology, 2009.

E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. A sticky HDP-HMM with application to speaker
  diarization. Annals of Applied Statistics, 5(2A):1020–1056, 2011.

E. B. Fox, M. C. Hughes, E. B. Sudderth, and M. I. Jordan. Joint modeling of multiple time series via
  the beta process with application to motion capture segmentation. Annals of Applied Statistics, 8(3):
  1281–1313, 2014.

Z. Gan, C. Li, R. Henao, D. Carlson, and L. Carin. Deep temporal sigmoid belief networks for sequence
  modeling. In Neural Information Processing Systems, 2015.

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis.
  CRC Press, 2013.

N. D. Goodman and A. Stuhlm¨
                           uller. The Design and Implementation of Probabilistic Programming Lan-
  guages. http://dippl.org, 2014. Accessed: 2016-3-18.

T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In Neural
  Information Processing Systems, 2007.

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences,
  2004.

K. A. Heller, Y. W. Teh, and D. G¨
                                 or¨
                                   ur. Infinite hierarchical hidden Markov models. In Artificial Intelligence
  and Statistics, 2009.

M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning
  Research, 14(1), 2013.

M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent Dirichlet allocation. In Neural
  Information Processing Systems, 2010.
                                                                                                         190


M. M. Hoffman, O. J. Buske, J. Wang, Z. Weng, J. A. Bilmes, and W. S. Noble. Unsupervised pattern
  discovery in human chromatin structure through genomic segmentation. Nature methods, 9(5):473–476,
  2012.

D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. Journal of
  Computer and System Sciences, 78(5):1460–1480, 2012.

M. C. Hughes and E. B. Sudderth. Memoized online variational inference for Dirichlet process mixture
  models. In Neural Information Processing Systems, 2013.

M. C. Hughes, D. I. Kim, and E. B. Sudderth. Reliable and scalable variational inference for the hierarchical
  Dirichlet process. In Artificial Intelligence and Statistics, 2015a.

M. C. Hughes, W. T. Stephenson, and E. B. Sudderth. Scalable adaptation of state complexity for nonpara-
  metric hidden Markov models. In Neural Information Processing Systems, 2015b.

H. Ishwaran and M. Zarepour. Exact and approximate sum representations for the Dirichlet process. Cana-
  dian Journal of Statistics, 30(2):269–283, 2002.

A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.

S. Jain and R. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture
  model. Journal of Computational and Graphical Statistics, 13(1):158–182, 2004.

T. Jebara. Discriminative, generative and imitative learning. PhD thesis, Massachusetts Institute of Tech-
  nology, 2001.

W. H. Jefferys and J. O. Berger. Ockham’s razor and Bayesian analysis. American Scientist, 80(1):64–72,
  1992.

K. Jiang, B. Kulis, and M. I. Jordan. Small-variance asymptotics for exponential family Dirichlet process
  mixture models. In Neural Information Processing Systems, 2012.

M. J. Johnson and A. S. Willsky. Stochastic variational inference for Bayesian time series models. In
  International Conference on Machine Learning, 2014.

M. I. Jordan. Graphical models. Statistical Science, 19(1):140–155, 2004.

B.-H. Juang and L. R. Rabiner. The segmental k-means algorithm for estimating parameters of hidden
  Markov models. Acoustics, Speech and Signal Processing, IEEE Transactions on, 38(9):1639–1641, 1990.

C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with
  an infinite relational model. In AAAI Conference on Artificial Intelligence, 2006.

D. I. Kim, M. Hughes, and E. Sudderth. The nonparametric metadata dependent relational model. In
  International Conference on Machine Learning, 2012.

D. Kingma and M. Welling. Auto-encoding variational bayes. In The International Conference on Learning
  Representations (ICLR), 2014.
                                                                                                        191


J. J. Kivinen, E. B. Sudderth, and M. I. Jordan. Learning multiscale representations of natural scenes using
  Dirichlet processes. In International Conference on Computer Vision, 2007.

A. Kucukelbir, R. Ranganath, A. Gelman, and D. M. Blei. Automatic variational inference in Stan. In
  Neural Information Processing Systems, 2015.

B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics. In Interna-
  tional Conference on Machine Learning, 2012.

K. Kurihara and M. Welling. Bayesian k-means as a “maximization-expectation” algorithm. Neural com-
  putation, 21(4):1145–1172, 2009.

K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational Dirichlet process mixtures. In Neural
  Information Processing Systems, 2006.

K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational dirichlet process mixture models. Interna-
  tional Joint Conference on Artificial Intelligence, 2007.

S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction
  and classification. In Neural Information Processing Systems, 2009.

S. L. Lauritzen. Graphical models. Clarendon Press, 1996.

L.-J. Li, R. Socher, and F.-F. Li. Towards total scene understanding: Classification, annotation and seg-
  mentation in an automatic framework. In IEEE Conf. on Computer Vision and Pattern Recognition,
  2009.

P. Liang, S. Petrov, M. I. Jordan, and D. Klein. The infinite PCFG using hierarchical Dirichlet processes.
  In Empirical Methods in Natural Language Processing, 2007.

S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137,
  1982.

D. J. C. MacKay. Ensemble learning for hidden Markov models. Technical report, Department of Physics,
  University of Cambridge, 1997.

D. J. C. MacKay. Choice of basis for Laplace approximation. Machine Learning, 33(1), 1998.

V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programming platform with
  programmable inference. arXiv preprint arXiv:1404.0099, 2014.

V. Mansinghka, P. Shafto, E. Jonas, C. Petschulat, M. Gasner, and J. B. Tenenbaum. Crosscat: A fully
  bayesian nonparametric method for analyzing heterogeneous, high dimensional data. arXiv preprint
  arXiv:1512.01272, 2015.

J. D. Mcauliffe and D. M. Blei. Supervised topic models. In Neural Information Processing Systems, 2008.

A. K. McCallum. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.

D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial
  regression. In Uncertainty in Artificial Intelligence, 2008.
                                                                                                        192


D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent Dirichlet allocation. In Interna-
  tional Conference on Machine Learning, 2012.

D. Mimno, D. M. Blei, and B. E. Engelhardt. Posterior predictive checks to quantify lack-of-fit in admixture
  models of latent population structure. Proceedings of the National Academy of Sciences, 112(26), 2015.

A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In International
  Conference on Machine Learning, 2014.

K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

D. R. Musser. Introspective sorting and selection algorithms. Softw., Pract. Exper., 27(8):983–993, 1997.

R. M. Neal. Bayesian mixture modeling. In Maximum Entropy and Bayesian Methods, pages 197–211.
  Springer, 1992.

R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other
  variants. In Learning in graphical models, pages 355–368. Springer, 1998.

S.-K. Ng and G. J. McLachlan. Speeding up the EM algorithm for mixture model-based segmentation of
  magnetic resonance images. Pattern Recognition, 37(8):1573–1589, 2004.

NIST. Rich transcriptions database. http://www.nist.gov/speech/tests/rt/, 2007.

P. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning, pages
  81–89. Springer, 2010.

J. Paisley, C. Wang, and D. Blei. The discrete infinite logistic normal distribution for mixed-membership
  modeling. In Artificial Intelligence and Statistics, 2011.

G. Parisi. Statistical Field Theory. Addison-Wesley, 1988.

                                                                                                 ´
J. Pitman and J. Picard. Combinatorial Stochastic Processes. Combinatorial Stochastic Processes: Ecole
     ´
   D’Et´e de Probabilit´es de Saint-Flour XXXII - 2002. Springer, 2006.

J. K. Pritchard, M. Stephens, N. A. Rosenberg, and P. Donnelly. Association mapping in structured popu-
  lations. The American Journal of Human Genetics, 67(1):170–181, 2000.

J. M. Quintana and M. West. An analysis of international exchange rates using multivariate dlm’s. The
  Statistician, pages 275–281, 1987.

L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.
  of the IEEE, 77(2):257–286, 1989.

R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Artificial Intelligence and
  Statistics, 2014.

C. E. Rasmussen. The infinite gaussian mixture model. In Neural Information Processing Systems, 1999.

C. E. Rasmussen and Z. Ghahramani. Occam’s razor. In Neural Information Processing Systems, 2001.
                                                                                                     193


D. Rosenbaum and Y. Weiss. The return of the gating network: Combining generative models and discrim-
  inative training in natural image priors. In Neural Information Processing Systems, 2015.

S. L. Scott. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal
  of the American Statistical Association, 97(457):337–351, 2002.

J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.

E. B. Sudderth. Graphical Models for Visual Object Recognition and Tracking. PhD thesis, Massachusetts
  Institute of Technology, 2006.

E. B. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent pitman-yor
  processes. In Neural Information Processing Systems, 2009.

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the
  American Statistical Association, 101(476):1566–1581, 2006.

Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In Neural Information
  Processing Systems, 2008.

L. Theis and M. D. Hoffman. A trust-region method for stochastic variational inference with applications
  to streaming data. In International Conference on Machine Learning, 2015.

N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizing variational
  bounds. Neural Networks, 15(1):1223–1241, 2002.

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. SMEM algorithm for mixture models. Neural
  Computation, 12(9):2109–2128, 2000.

J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the infinite hidden Markov
  model. In International Conference on Machine Learning, 2008.

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.
  Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.

S. G. Walker. Sampling the dirichlet mixture model with slices. Communications in Statistics—Simulation
  and Computation®, 36(1):45–54, 2007.

C. Wang and D. Blei. Truncation-free online variational inference for Bayesian nonparametric models. In
  Neural Information Processing Systems, 2012a.

C. Wang and D. M. Blei. A split-merge MCMC algorithm for the hierarchical Dirichlet process. arXiv
  preprint arXiv:1201.1657, 2012b.

C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. In
  Artificial Intelligence and Statistics, 2011.

Y. Wang, P. Sabzmeydani, and G. Mori. Semi-latent dirichlet allocation: A hierarchical model for human
  action recognition. In Human Motion–Understanding, Modeling, Capture and Animation, pages 240–254.
  Springer, 2007.
                                                                                                        194


Y. J. Wang and G. Y. Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical
  Association, 82(397):8–19, 1987.

J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition from
  abbey to zoo. In CVPR, 2010.

L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document
  collections. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.

A. Zhang, J. Zhu, and B. Zhang. Max-margin infinite hidden markov models. In International Conference
  on Machine Learning, 2014.

C. Zhang, C. H. Ek, X. Gratal, F. T. Pokorny, and H. Kjellstrom. Supervised hierarchical Dirichlet processes
  with variational inference. In ICCV Workshop on Inference for probabilistic graphical models, 2013.

J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models. The Journal of
  Machine Learning Research, 13(1):2237–2278, 2012.

J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with fast sampling algorithms.
  In International Conference on Machine Learning, 2013.

D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In
  International Conference on Computer Vision, 2011.

D. Zoran and Y. Weiss. Natural images, Gaussian mixtures and dead leaves. In Neural Information Processing
  Systems, 2012.