Structured Generative Models of Natural Source Code
We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.
💡 Research Summary
The paper tackles the challenge of building a generative model for natural source code (NSC), i.e., code written and read by humans, by explicitly incorporating the hierarchical and sequential structure inherent in programming languages, learning distributed representations of code elements, and tightly integrating compiler information. The authors argue that flat token‑level models (e.g., n‑grams) fail to capture essential properties such as nesting, variable scoping, and type constraints, and therefore propose a family of models built on abstract syntax trees (ASTs) generated by a probabilistic push‑down automaton they call Log‑bilinear Tree‑Traversal (LTT).
The core idea is to generate an AST in depth‑first order using a stack. At each step an internal node nᵢ is popped, a traversal state vector hᵢ (which summarises the partial tree, scope, and other contextual information) is updated, and a child‑tuple Cᵢ is sampled. The distribution p(Cᵢ | nᵢ, hᵢ) is parameterised with a log‑bilinear energy function: an inner product between a representation of the child tuple R_ch(Cᵢ) and a context representation R_con(nᵢ, hᵢ) plus a bias term. The context representation is a linear combination of the embedding of the parent node and the embeddings of the components of hᵢ, weighted by diagonal matrices that allow position‑dependent modulation. This formulation keeps the number of parameters linear in the dimensionality of hᵢ, enabling high‑dimensional contextual vectors without combinatorial explosion.
Two classes of traversal variables are distinguished. Deterministic variables can be computed directly from the already generated partial tree (e.g., the set of variables currently in scope, the types of ancestor nodes, recent tokens). These variables can be incorporated into the transition distribution p(hᵢ | hᵢ₋₁) to become p(hᵢ | h₀:ᵢ₋₁, n₁:ᵢ, α₁:ₜ), thereby allowing the model to condition on arbitrary previously generated structure while preserving tractable inference. Latent variables, by contrast, are learned as hidden states and are integrated via standard gradient‑based learning.
A key extension is the explicit modeling of variable scoping. By feeding the list of in‑scope variables (including their types and names) into hᵢ, the model learns to generate identifier tokens that respect declarations, avoid out‑of‑scope references, and follow common naming conventions (e.g., outer loop variable “i”, inner loop variable “j”). This dramatically improves the realism of generated code compared with a plain PCFG that samples identifiers independently.
Training maximises the log‑likelihood of observed ASTs. Because the generation process is fully differentiable (the log‑bilinear scores are smooth), stochastic gradient ascent (e.g., Adam) can be applied on mini‑batches of trees. The authors use the Roslyn C# compiler to extract ASTs and scope information from a large corpus of open‑source C# projects (≈10 M lines of code). They compare LTT against several baselines: n‑gram language models, standard PCFGs, and a log‑bilinear tree model that lacks traversal variables. Evaluation metrics include held‑out log‑likelihood, the percentage of sampled programs that compile, and human judgments of code naturalness.
Results show that LTT achieves substantially higher log‑likelihoods and generates compilable code at a far higher rate than baselines. Sampled for‑loops from LTT exhibit sensible variable naming and correct scoping, whereas PCFG samples often contain nonsensical identifiers or out‑of‑scope uses. The paper demonstrates that integrating compiler‑level abstractions (AST structure, scoping rules) into a probabilistic model yields tangible benefits for code generation tasks.
In conclusion, the authors present a principled, efficient framework for learning generative models of source code that respect both syntactic hierarchy and semantic context. They argue that such models can serve as a unified prior for a variety of software‑engineering applications, including autocomplete, bug detection, API recommendation, and program synthesis. Future work is suggested on extending the approach to richer language features (generics, lambdas), handling multi‑file projects, and exploring deeper latent representations for code semantics.
Comments & Academic Discussion
Loading comments...
Leave a Comment