This dissertation investigates learning dependency grammars for statistical natural language parsing from corpora without parse tree annotations. Most successful work in unsupervised dependency grammar induction has assumed that the input consists of sequences of parts-of-speech, ignoring words and using extremely simple probabilistic models. However, supervised parsing has long shown the value of more sophisticated models which use lexical features. These more sophisticated models however require probability distributions with complex conditioning information, which must be smoothed to avoid sparsity issues.<br/> In this work we explore several dependency grammars that use smoothing, and lexical features. We explore a variety of different smoothing regimens, and find that smoothing is helpful for even unlexicalized models such as the Dependency Model with Valence. Furthermore, adding lexical features yields the highest accuracy dependency induction on the Penn Treebank WJS10 corpus to date. In sum, this dissertation extends unsupervised grammar induction by incorporating lexical conditional information, by investigating smoothing in an unsupervised framework.
Headden, William P.,
"Unsupervised Bayesian Lexicalized Dependency Grammar Induction"
(2012).
Computer Science Theses and Dissertations.
Brown Digital Repository. Brown University Library.
https://doi.org/10.7301/Z0N29V7J