Finding Pearls in the Literature Ocean: Cis-Lexicon Ontology Search Engine (CLOSE) for Gene Regulatory Networks of the Regulatory Genome


Cis-regulatory-modules are the “computational unit” of the gene regulatory network. CRMs are where transcription factors bind and activate the production of more transcription factors or other proteins. CRMs can be hard to find. Machine learning techniques have been used to predict their location in the genome, and there are databases available that catalogue these predicted CRMs, however these methods have only been able to achieve about a 50% accuracy when translating to actual functionality, which makes the data almost useless for accurately characterizing CRMs. For CRMs that have been verified by the experimental techniques specified by the Davidson Criteria, however, we can be certain of their function in the genome. Biology papers are freeform, and the same term can be “promiscuous” and have different uses depending on context, which makes teaching a computer to recognize whether a given paper uses a specific experimental technique to identify a CRM an incredibly difficult task. General purpose machine learning techniques such as naive Bayes, support vector machines, and basic neural networks all failed to identify a satisfactory number of the known cis-Regulatory papers and identified far too many false positives to be worth an annotator’s time. We have developed the “Lattice-CLOSE” algorithm to accurately separate unrelated papers from the papers that use the Davidson criteria to identify the function of CRMs. It creates an accurate and easily interpretable classifier to identify these papers. This technique was used to identify 870 papers using the aforementioned techniques. By altering the “concept list” parameters used to train the algorithm, Lattice-CLOSE can be applied to other tasks in biomedical literature extraction that require filtering papers looking for specific techniques or concepts.

