Reward Design with Program Graphs for Reinforcement Learning Guided Training of Large Language Models for Program Synthesis
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In recent years, Large Language Models (LLMs) have increasingly been utilized to solve the problem of automatic code generation, also known as program synthesis. Frequently termed Code LLMs, these models continue to make headlines as new model architectures, training techniques and benchmark datasets are explored by academic institutions and commercial entities all around the world. For all its strengths and impact, most Code LLMs share one similarity that also happens to be a limitation: the models are trained using standard supervised learning objectives such as next token prediction (NTP). In effect, the models are trained on code data as if it were a natural language text thereby ignoring unique properties of code such as syntax and semantics that have the potential to serve as rich signal to the LLM. To address this limitation, training frameworks that incorporate alternative techniques such as reinforcement learning (RL) have been proposed. However, the introduction of RL to this set up brings an additional challenge: the task of designing a reward function for the Code LLM.
This thesis looks at reward design within the context of RL for Code LLMs. We hypothesize that a better reward signal for Code LLMs undergoing RL-based training will result in a better resulting model, as measured by performance on downstream code generation tasks. To test this hypothesis, we design a model, also a Code LLM, that is able to perform a deep semantic analysis on code in order to assign scores to programs. These scores serve as a measure of code quality, specifically the syntactic and semantic correctness of the generated code in a given context. Consistent with the terminology introduced in an earlier work, we call this model a \emph{discriminator} as its capabilities allow it to distinguish between human-written and machine-generated code. First, we design and build a discriminator and analyze its performance on a standard benchmark dataset. In contrast to existing versions of the discriminator, our proposed framework incorporates signal from the code text as well as the corresponding code graphs, including data flow graphs (DFG) and control flow graphs (CFG). We find that our proposed model is able to significantly outperform existing baselines in the task of distinguishing between human-written and machine-generated programs. Next, we deploy this enhanced discriminator within the context of RL-based training of Code LLMs. We perform a comprehensive analysis of the performance of classic Code LLMs trained using NTP objectives and how these compare against Code LLMs trained using RL, using both the existing discriminator as well as our novel graph-based discriminator. Through these experiments, we explore the role of reward functions in influencing the RL training of Code LLMs and the potential of deploying RL-based techniques within the space of LLMs for code.