BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives

BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Hard negatives are essential for training effective retrieval models. Hard-negative mining typically relies on ranking documents using cross-encoders or static embedding models based on similarity metrics such as cosine distance. Hard negative mining becomes challenging for biomedical and scientific domains due to the difficulty in distinguishing between source and hard negative documents. However, referenced documents naturally share contextual relevance with the source document but are not duplicates, making them well-suited as hard negatives. In this work, we propose BiCA: Biomedical Dense Retrieval with Citation-Aware Hard Negatives, an approach for hard-negative mining by utilizing citation links in 20,000 PubMed articles for improving a domain-specific small dense retriever. We fine-tune the GTE_small and GTE_Base models using these citation-informed negatives and observe consistent improvements in zero-shot dense retrieval using nDCG@10 for both in-domain and out-of-domain tasks on BEIR and outperform baselines on long-tailed topics in LoTTE using Success@5. Our findings highlight the potential of leveraging document link structure to generate highly informative negatives, enabling state-of-the-art performance with minimal fine-tuning and demonstrating a path towards highly data-efficient domain adaptation.


💡 Research Summary

The paper introduces BiCA, a novel approach for biomedical dense retrieval that leverages citation information to generate high‑quality hard negatives. Recognizing that hard negatives—documents that are semantically similar to a query but not relevant—are crucial for training effective dense retrievers, the authors observe that cited documents naturally share contextual relevance with a source article while remaining distinct, making them ideal hard negatives. To exploit this, they construct a two‑hop citation neighborhood for each of 20,000 PubMed abstracts using the NCBI E‑utilities API and the pubmed‑parser library. The first hop consists of papers directly cited by the seed article; the second hop includes papers cited by those first‑hop papers. All abstracts are encoded with PubMedBERT‑base embeddings, and a dense similarity graph is built by computing pairwise cosine similarities among the candidates.

Hard negative mining proceeds in four stages. First, a synthetic query is generated from the positive abstract using a Doc2Query T5 model, ensuring realistic search intent. Second, the similarity graph is constructed. Third, multiple stochastic traversals are performed: three start nodes (the most query‑similar 1‑hop papers) are selected, and for each, a five‑step walk is executed. At each step, the top‑five unvisited neighbors are considered, and one is sampled probabilistically based on similarity, promoting diversity while still focusing on semantically close documents. A global visited set prevents duplication across walks. Finally, an additional random negative is added for robustness. This process yields on average 6.5 hard negatives per query, resulting in a curated training set of roughly 150,000 documents.

The authors fine‑tune two GTE models—GTE_small (33 M parameters, 384‑dim embeddings) and GTE_Base (110 M parameters, 768‑dim embeddings)—using a Multiple Negative Ranking (MNR) loss. Training is deliberately lightweight: only 20 optimization steps on a single NVIDIA V100 GPU, demonstrating the data‑efficiency of the approach.

Evaluation is conducted in a zero‑shot setting on fourteen BEIR datasets and four LoTTE sub‑tasks. Using nDCG@10 as the primary metric for BEIR, BiCA_Base achieves an average score of 0.684 and BiCA_small 0.661, surpassing strong baselines such as GTR‑Base (0.539), GTR‑Large (0.557), DPR (0.332), and various recent dense retrievers. On LoTTE, which emphasizes long‑tailed topics, both models attain a Success@5 of 0.815, outperforming all baselines. A latency analysis shows that BiCA_small processes queries in approximately 1.2 ms on a V100, confirming its suitability for real‑time applications.

Key contributions include: (1) a citation‑aware hard negative mining pipeline that transforms structural citation links into semantically rich training signals; (2) the release of two domain‑specific dense retrievers (BiCA_Base and BiCA_small) that achieve state‑of‑the‑art performance with minimal fine‑tuning; (3) extensive zero‑shot benchmarking demonstrating strong cross‑domain generalization; and (4) a thorough latency assessment highlighting practical deployment feasibility.

The study also discusses limitations: reliance on citation data may reduce coverage for newly published or sparsely cited papers, and the current pipeline uses only abstracts, omitting full‑text and additional metadata that could further enrich negative sampling. Future work is suggested to incorporate other graph signals (e.g., co‑authorship, keyword co‑occurrence), to explore full‑text embeddings, and to combine the citation‑based negatives with large‑language‑model‑generated synthetic documents for even richer training sets.

In summary, BiCA shows that leveraging the inherent link structure of scientific literature can produce highly informative hard negatives, enabling small, efficient dense retrievers to match or exceed the performance of much larger models. This citation‑driven data augmentation offers a promising, scalable path for domain‑adapted information retrieval across any field where citation or hyperlink networks are available.


Comments & Academic Discussion

Loading comments...

Leave a Comment