Improving complex reasoning in large language models
Abstract
This thesis studies complex reasoning in language models. We use the term reasoning to refer to tasks that would require a human to perform slow deliberate, step-by-step thinking (instead of providing an intuitive and instantaneous response) , such as mathematical and scientific reasoning, commonsense reasoning, logical reasoning, and strategic reasoning. We use reasoning capability to collectively refer to the ability to solve tasks requiring complex sub-problem decomposition and detailed step-by-step analysis.
Our motivation for studying reasoning in language models stems from intriguing theoretical properties (e.g., how scaling laws relate to emergent abilities) and their vast application potential. From an application perspective, we envisage large language models (LLMs) to become the next-generation computational platforms, just like operating systems, and aim to build a new application ecosystem upon LLMs. This vision naturally requires the underlying base model to be able to reason over various complex real-world scenarios. From a modeling perspective, complex reasoning is viewed as a typical ability that emerges with scaling: given other conditions being proper (e.g., given clean data and stable training process), the more compute one spends, the more likely the model has stronger reasoning capability.
We start by reviewing the learning paradigms of large language models, and then discuss fundamental methods for improving reasoning along multiple stages of the model development pipeline. Typically, modern language model development consists of four stages: pretraining, instruction finetuning, reinforcement learning from human feedback, and in-context learning after model deployment. This thesis discusses improving reasoning by in-context learning, finetuning, and learning from feedback. For in-context learning, we propose complexity-based prompting, and demonstrate that the model’s scientific and logical reasoning performance consistently improves as the complexity of in-context demonstrations improves. This work achieved state-of-the-art performance on the GSM8K [Cobbe et al., 2021] and MATH [Hendrycks et al.] datasets at the time it was proposed and has influenced follow-on work by highlighting the importance of data complexity. For instruction tuning, we devise a detailed recipe for
specializing smaller language models on mathematical reasoning tasks. We highlight the importance of chain-of-though formatted data, the use of a finetuned checkpoint, and the balance between capabilities of different directions. This work significantly improved small models’ GSM8K and other math performance by the time it was proposed and has consistently influenced follow-on work by highlighting the importance of capability balancing. For learning from AI feedback, we show the possibility of constructing a self-improving agent on strategic reasoning tasks by letting agents play against and criticize each other, and show that the ability to self-improve is strongly correlated with the base model and how much it aligns with human instructions. Finally, we review the current state-of-the-art models, highlighting the benchmark saturation problem and the importance of constructing new challenging datasets. We further discuss future directions on multimodal scaling and iterative learning from human, environment, and AI feedback.
This item appears in the following Collection(s)

