Bo Dai

Bo Dai

My research interests lie on designing principled machine learning methods. Currently, I mainly focus on three major themes:
  • Reinforcement learning: design effective algorithms by exploiting the intrinsic structures in the uncertain dynamics for automatic decision making.
  • Learning to design algorithms: improve the algorithms, e.g., sampling, searching and planning, by leveraging empirical experiences.
  • Structured input and output: build effective models for capturing the structures information in input and output, e.g., binaries, sequences, programs, trees, and graphs.
More information can be found in Google Scholar and my personal homepage.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action-Spaces
Haitong Ma
Ofir Nabati
Na Li
Shie Mannor
Guy Tennenholtz
Proceedings of the 43rd International Conference on Machine Learning (ICML-26), Seoul, South Korea (2026)
Preview abstract Reinforcement learning (RL) algorithms have achieved superhuman performance on many sequential decision-making tasks, but often struggle in domains with large, combinatorial action spaces. To address this, we introduce a practical and stable algorithm for training discrete diffusion models to represent policies in such environments. We formulate a policy mirror descent algorithm that enhances training stability by reframing policy optimization as an inference problem, which naturally aligns with the learning objective of discrete diffusion models. Through extensive experiments on a suite of challenging benchmark tasks, we demonstrate that our approach achieves significant improvements over existing methods in both performance and sample efficiency. This work opens a promising new direction for applying discrete diffusion models in RL to tackle long-standing challenges in large-scale combinatorial action spaces. View details
Diffusion Controller: Framework, Algorithms and Parameterization
Tong Yang
Moonkyung Ryu
Guy Tennenholtz
Yuejie Chi
Proceedings of the 43rd International Conference on Machine Learning (ICML-26), Seoul, South Korea (2026)
Preview abstract Controllable generation with diffusion models is often treated as a collection of heuristics rather than a unified optimization problem. We propose a principled control formulation by viewing the diffusion reverse process as an instance of a (generalized) linearly-solvable Markov decision process (LS-MDP). This perspective turns controllable generation into regularized optimal control around a pretrained diffusion policy, yielding tractable objectives and algorithmic updates. Under this framework, we study two practical finetuning regimes. When paired target data are available, we obtain a supervised finetuning (SFT) objective. When only a terminal reward model is available, we derive reinforcement-learning finetuning (RLFT) methods from the LS-MDP solution structure, including (i) a reward-weighted regression loss and (ii) a policy-gradient approach (with standard extensions such as PPO). Crucially, the LS-MDP optimality conditions imply an explicit relationship between the optimal and pretrained score functions. We leverage this to derive a new score-function parameterization that isolates the control signal and enables “gray-box” finetuning with substantially fewer trainable parameters. Experiments across SFT and RLFT show this parameterization improves over existing finetuning baselines while achieving stronger sample/parameter efficiency. View details
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Guy Tennenholtz
Izzeddin Gur
Vincent Zhuang
Aviral Kumar
Rishabh Agarwal
Sridhar Thiagarajan
Aleksandra Faust
Proceedings of the 13th International Conference on Learning Representations (ICLR-25), Singapore (2025)
Preview abstract Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning (RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input—a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%. View details
Preview abstract The alignment of language models (LMs) with human values increasingly relies on using other LMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on deterministic preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a truly reliable autorater must learn to model the full distribution of preference defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: direct supervised fine-tuning for dense, probabilistic labels, and a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks. View details
UQE: A Query Engine for Unstructured Databases
Hanjun Dai
Bethany Wang
Sherry Yang
Pengcheng Yin
Phitchaya Mangpo Phothilimthana
Advances in Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation. View details
Preview abstract Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt the score-based modeling to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks. View details
Can Small Heads Help? Understanding and Improving Multi-Task Generalization
Christopher Fifty
Dong Lin
Li Wei
Lichan Hong
Yuyan Wang
Zhe Zhao
the WebConf 2022 (2022)
Preview abstract A goal for multi-task learning from a multi-objective optimization perspective is to find the Pareto solutions that are not dominated by others. In this paper, we provide some insights on understanding the trade-off between Pareto efficiency and generalization, as a result of parameterization in deep learning: as a multi-objective optimization problem, enough parameterization is needed for handling task conflicts in a constrained solution space; however, from a multi-task generalization perspective, over-parameterization undermines the benefit of learning a shared representation which helps harder tasks or tasks with limited training examples. A delicate balance between multi-task generalization and multi-objective optimization is therefore needed for finding a better trade-off between efficiency and generalization. To this end, we propose a method of under-parameterized self-auxiliaries for multi-task models to achieve the best of both worlds. It is model-agnostic, task-agnostic and works with other multi-task learning algorithms. Empirical results show our method improves Pareto efficiency over existing popular algorithms on several multi-task applications. View details
Neural Stochastic Dual Dynamic Programming
Hanjun Dai
Emily Xue
Zia M Syed
ICLR 2022 (2022)
Preview abstract Stochastic dual dynamic programming~(SDDP) is one of the state-of-the-art algorithm for multi-stage stochastic optimization, yet its cost exponentially increases w.r.t. the size of decision variables, therefore, quickly becomes inapplicable for high-dimension problems. We introduce a neuralized component into SDDP, which outputs a \emph{piece-wise linear function} in a \emph{low-dimension} space to approximate the value function, based on the \emph{context of the problem instances}. The neuralized component will consistently evolve to abstract effective low-dimension action space and improve the quality of value function approximation for each problem based on prior successful experiences. It is seamlessly integrated with SDDP, formed our neural enhanced solver,~\AlgName~(\algshort), which achieves the optimality \emph{without loss of accuracy} in \emph{faster speed} for high-dimension and long-horizon multi-stage stochastic optimizations. We conduct thorough empirical experiments to demonstrate the benefits of \algshort from transferability on scalability.~\algshort significantly outperforms the competitors, including SDDP and variants of RL algorithms, in terms of solution quality and feasibility, and computational speed. View details
Preview abstract Retrosynthesis is the process of identifying a set of reactants to synthesize a target molecule. It is critical to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. However, the inner connections of these models are rarely discussed, and rigorous evaluations of these models are largely in need. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) with different energy functions. This unified view establishes connections and reveals the differences between models, thereby enhances our understanding of model design. We also provide a comprehensive assessment of performance to the community. Additionally, we present a novel dual variant within the framework that performs consistent training to induce the agreement between forward- and backward-prediction. This model improves the state-of-the-art of template-free methods with or without reaction types. View details
On the Optimality of Batch Policy Optimization Algorithms
Chenjun Xiao
Yifan Wu
Tor Lattimore
Jincheng Mei
Lihong Li
ICML 2021 (2021)
Preview abstract Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization. View details
×