Source Forager: A Search Engine for Similar Source Code

Source Forager: A Search Engine for Similar Source Code
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Developers spend a significant amount of time searching for code: e.g., to understand how to complete, correct, or adapt their own code for a new context. Unfortunately, the state of the art in code search has not evolved much beyond text search over tokenized source. Code has much richer structure and semantics than normal text, and this property can be exploited to specialize the code-search process for better querying, searching, and ranking of code-search results. We present a new code-search engine named Source Forager. Given a query in the form of a C/C++ function, Source Forager searches a pre-populated code database for similar C/C++ functions. Source Forager preprocesses the database to extract a variety of simple code features that capture different aspects of code. A search returns the $k$ functions in the database that are most similar to the query, based on the various extracted code features. We tested the usefulness of Source Forager using a variety of code-search queries from two domains. Our experiments show that the ranked results returned by Source Forager are accurate, and that query-relevant functions can be reliably retrieved even when searching through a large code database that contains very few query-relevant functions. We believe that Source Forager is a first step towards much-needed tools that provide a better code-search experience.


💡 Research Summary

Source Forager is a novel code‑search engine that addresses the shortcomings of traditional text‑based search tools, which treat source code merely as a bag of tokens and ignore its rich structural and semantic information. The system accepts a C/C++ function as a query and returns the k most similar functions from a pre‑populated database. Its core contribution is the introduction of “feature‑classes”: a set of orthogonal code characteristics (e.g., type‑operation coupling, skeleton abstract‑syntax‑tree patterns, decorated skeletons, natural‑language terms from comments, string and numeric literals, library calls, type signatures, local types, etc.). For each function, a feature‑vector is built by extracting a feature‑observation for every class. Most feature‑observations are sets, and similarity within a class is measured by the Jaccard index; a few classes use custom tree‑based similarity functions.

The architecture consists of an offline phase and an online phase. During the offline phase the entire code corpus is parsed, each function’s feature‑vector is generated, and the vectors are stored in Pliny‑DB, an in‑memory object store that provides fast random access and native similarity functions. In the online phase a query function is processed through the same feature extraction pipeline, producing a query vector. A weighted combination of per‑class similarity scores yields a single composite similarity value (weighted average). The system then scans all vectors, maintaining a priority queue of size k to keep the top‑k results. Experiments show that on a database of 500 000 functions, the top‑10 results are retrieved in under 2 seconds on a single 8‑core machine.

Weighting of feature‑classes can be configured in two ways. The “dyn‑select” mode automatically selects a subset of feature‑classes that are present in the query, which is useful when no domain knowledge is available. The “svm‑weights” mode pre‑computes class weights using supervised learning: a linear SVM is trained on a labeled dataset from a specific domain (e.g., algorithm implementations, system‑level code) to predict the importance of each class. Empirical results demonstrate that domain‑specific weights significantly improve precision and recall, especially when relevant functions are sparse in the database.

Extensibility is a first‑class design goal. Adding a new feature‑class requires (1) implementing a feature extractor that produces the observation for any function, and (2) providing a similarity function for that observation type. The current implementation uses CodeSonar for C/C++ parsing, but the architecture is abstracted so that any static analysis tool or even a different programming language can be swapped in. Moreover, while the prototype operates on functions, the same pipeline can be retargeted to classes, methods, or modules without fundamental changes.

The authors evaluated Source Forager on two real‑world domains with 100+ query functions each, comparing against a baseline text‑search system. Source Forager achieved a Top‑5 accuracy of 78 % versus 42 % for the baseline, and an average query latency of 1.8 seconds, confirming both effectiveness and usability.

Limitations include the relatively coarse granularity of many feature‑classes (they capture sets of tokens or simple AST patterns but not full data‑flow or control‑dependence information) and the current focus on C/C++, which requires new parsers for other languages. Future work outlined by the authors includes richer graph‑based flow analyses, integration of neural embeddings, and a distributed version of Pliny‑DB to enable truly massive, cloud‑scale code search.

In summary, Source Forager demonstrates that a multi‑feature, similarity‑weighted approach can dramatically improve code‑search relevance while maintaining interactive response times, offering a practical foundation for next‑generation developer tools that understand code beyond plain text.


Comments & Academic Discussion

Loading comments...

Leave a Comment