Neural compilation and decompilation with conditional language models
Item Status
Embargo End Date
Date
Authors
Armengol Estapé, Jordi
Abstract
Compilers routinely translate programming languages into lower-level languages, enabling optimization and execution across hardware architectures. However, manually
crafting these translators requires decades of engineering effort. Moreover, while compilers excel at lowering code, rule-based systems face challenges in translating in the
opposite direction, limiting applications in cybersecurity and software migration.
This thesis identifies compilation as both an opportunity for automation and a source
of data for this automation by means of conditional language models that explicitly
learn low-level languages in the context of compilers. First, it proposes a new machine
learning task, neural compilation, to study models that learn to emulate a compiler.
Then, it develops a large-scale dataset of compilable and executable functions. Next,
it uses this new dataset to build a state-of-the-art neural decompiler that can produce
readable and accurate C from optimized assembly. Finally, it trains a neural lifter to
translate legacy assembly into compact compiler representations that can be reoptimized
and recompiled to new architectures.
This thesis shows that explicitly learning from compilation data with conditional
language models is a promising alternative to both traditional rule-based approaches
and general-purpose language models with less exposure to low-level languages.
This item appears in the following Collection(s)

