Improving Information Retrieval through Contextual Ranking with Large Language Models

Zerveas, George

Full Metadata

Overview

Year:: 2024
Contributor:: Zerveas, George (creator); Eickhoff, Carsten (Advisor); Pavlick, Ellie (Reader); Bach, Stephen (Reader); Brown University. Department of Computer Science (sponsor)
Genre:: theses
Subject:: Natural language processing (Computer science); Deep learning (Machine learning); Large Language Models; Information retrieval
Extent:: , None p.

Files

Description

Abstract:: Large pre-trained transformer-based language models (PTLMs) have recently dominated the state-of-the-art in Information Retrieval tasks such as web search and question answering. Despite the advantages such models offer with respect to utilizing term context for building better query and document representations, their training currently relies on point-wise or pair-wise similarity learning, and overlooks the effect of ranking context, that is, of jointly scoring a large set of documents closely related to the query. The latter setting better reproduces the target objective of comparing a query against a large collection of documents and had been found beneficial in pre-PTLM Learning-to-Rank literature. In the present work, we first expressly investigate the effect of ranking context and its constituent parts: (1) jointly scoring a large number of candidates, (2) using retrieved (query-specific) instead of random negatives, and (3) a fully list-wise loss. To this end, we introduce COntextual Document Embedding Reranking (CODER), a highly efficient and generic fine-tuning framework that for the first time enables incorporating context into transformer-based language models used in state-of-the-art dense retrieval. CODER acts as a lightweight performance enhancing framework that can be applied to virtually any existing dual-encoder model. We next explore the potential that CODER offers in directly optimizing retrieval for essentially context-dependent properties, such as ranking fairness. We find that, compared to the existing alternatives for deep neural retrieval architectures, our end-to-end differentiable and efficient approach based on CODER can attain much stronger bias mitigation (fairness). At the same time, for the same amount of bias mitigation, it offers significantly better relevance performance (utility). Crucially, our method allows for a more finely controllable and predictable intensity of bias mitigation. Lastly, we seek to enhance the ranking context itself by addressing the problem of sparse relevance annotation in modern large-scale retrieval datasets. To mitigate penalizing the model in case of false negatives during training, we propose evidence-based label smoothing, i.e., propagating relevance from the ground-truth documents to unlabeled documents that are intimately related to them. To that end, we leverage the concept of reciprocal neighbors, moving beyond geometric similarity and exploiting local connectivity in the shared representation space. We find that using the CODER framework to fine-tune retrievers based on the recomputed labels substantially improves ranking effectiveness.
Notes:: Thesis (Ph. D.)--Brown University, 2024

Content

Citation

Zerveas, George, "Improving Information Retrieval through Contextual Ranking with Large Language Models" (2024). Computer Science Theses and Dissertations. Brown Digital Repository. Brown University Library. https://repository.library.brown.edu/studio/item/bdr:caa8g3xc/

Relations

Collection:

Computer Science Theses and Dissertations

Theses and Dissertations for the Computer Science department.
...