DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval
Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.
💡 Research Summary
The paper addresses a critical gap in patent prior‑art retrieval: the difficulty of finding relevant patents when the query and the target belong to different technological domains. Existing benchmarks such as CLEF‑IP, TREC‑Patent Track, MAREC, and BigPatent either focus on a single jurisdiction, lack explicit domain partitions, or do not aggregate at the family level, making systematic cross‑domain evaluation impossible. To fill this void, the authors introduce DAPFAM (Domain‑Aware Family‑level Patent Retrieval Benchmark), a novel dataset that explicitly separates IN‑domain and OUT‑domain retrieval scenarios using an IPC‑3 (International Patent Classification three‑digit) overlap scheme.
DAPFAM consists of 1,247 query patent families and 45,336 target families, all aggregated at the family level to eliminate redundancy across international filings. Relevance judgments are derived from citation links, mirroring the signals used by patent examiners. A query is labeled IN‑domain if it shares at least one IPC‑3 code with a target family; otherwise it is OUT‑domain. This binary partition enables a direct measurement of the “domain gap” that has previously been hidden in aggregate metrics.
The experimental protocol is exhaustive: 249 controlled runs explore (i) lexical retrieval with BM25, (ii) a single multilingual transformer encoder for dense retrieval, (iii) document‑level versus passage‑level indexing, (iv) four passage aggregation strategies (maxP, avgP, sumP, avg_top3), (v) multiple query field combinations (title, abstract, claims, description), and (vi) hybrid fusion via Reciprocal Rank Fusion (RRF). Evaluation uses NDCG@100 and Recall@100, balancing ranking quality and coverage.
Key findings:
- Domain Gap – Across all configurations, OUT‑domain performance is roughly five times lower than IN‑domain (both NDCG and Recall). This demonstrates that current IR models, even dense neural ones, struggle to bridge the semantic and terminological distance between disparate technology areas.
- Passage‑Level Advantage – Splitting patents into fixed‑size passages and aggregating scores consistently outperforms whole‑document retrieval. The maxP strategy (taking the highest passage score) yields the best results, confirming that the most relevant snippet often drives the overall relevance judgment.
- Dense vs Lexical – The transformer‑based dense encoder provides modest gains (3‑7% relative improvement) over BM25, but does not close the domain gap. Dense representations capture some semantic similarity but remain vulnerable to domain‑specific vocabularies.
- Hybrid Fusion – RRF applied at the document level offers a favorable effectiveness‑efficiency trade‑off: it improves NDCG by 4‑6% with minimal additional indexing or latency cost, making it attractive for production‑grade patent search systems.
All data, preprocessing scripts, indexing pipelines, and evaluation code are released on HuggingFace (https://huggingface.co/datasets/datalyes/DAPFAM_patent), ensuring reproducibility and low entry barriers for future research. The authors argue that DAPFAM will become a standard testbed for developing domain‑robust patent IR methods, encouraging work on domain adaptation, meta‑learning, and the integration of structured metadata (IPC codes, citation graphs) to mitigate the observed performance drop in cross‑domain scenarios.
In summary, DAPFAM provides the first family‑level, cross‑domain benchmark for patent retrieval, quantifies the persistent five‑fold performance penalty when moving out of domain, validates the superiority of passage‑level indexing, shows limited but consistent benefits from dense embeddings, and demonstrates that simple rank‑fusion can yield practical improvements. The dataset and experimental framework lay the groundwork for the next generation of robust, domain‑agnostic patent search technologies.
Comments & Academic Discussion
Loading comments...
Leave a Comment