Edinburgh Research Archive

Neural compilation and decompilation with conditional language models

Item Status

Embargo End Date

Authors

Armengol Estapé, Jordi

Abstract

Compilers routinely translate programming languages into lower-level languages, enabling optimization and execution across hardware architectures. However, manually crafting these translators requires decades of engineering effort. Moreover, while compilers excel at lowering code, rule-based systems face challenges in translating in the opposite direction, limiting applications in cybersecurity and software migration. This thesis identifies compilation as both an opportunity for automation and a source of data for this automation by means of conditional language models that explicitly learn low-level languages in the context of compilers. First, it proposes a new machine learning task, neural compilation, to study models that learn to emulate a compiler. Then, it develops a large-scale dataset of compilable and executable functions. Next, it uses this new dataset to build a state-of-the-art neural decompiler that can produce readable and accurate C from optimized assembly. Finally, it trains a neural lifter to translate legacy assembly into compact compiler representations that can be reoptimized and recompiled to new architectures. This thesis shows that explicitly learning from compilation data with conditional language models is a promising alternative to both traditional rule-based approaches and general-purpose language models with less exposure to low-level languages.

This item appears in the following Collection(s)