My research interests lie on designing principled machine learning methods. Currently, I mainly focus on three major themes:
Reinforcement learning: design effective algorithms by exploiting the intrinsic structures in the uncertain dynamics for automatic decision making.
Learning to design algorithms: improve the algorithms, e.g., sampling, searching and planning, by leveraging empirical experiences.
Structured input and output: build effective models for capturing the structures information in input and output, e.g., binaries, sequences, programs, trees, and graphs.
Proceedings of the 43rd International Conference on Machine Learning (ICML-26), Seoul, South Korea (2026)
Preview abstract
Reinforcement learning (RL) algorithms have achieved superhuman performance
on many sequential decision-making tasks, but often struggle in domains with
large, combinatorial action spaces. To address this, we introduce a practical and
stable algorithm for training discrete diffusion models to represent policies in
such environments. We formulate a policy mirror descent algorithm that enhances
training stability by reframing policy optimization as an inference problem, which
naturally aligns with the learning objective of discrete diffusion models. Through
extensive experiments on a suite of challenging benchmark tasks, we demonstrate
that our approach achieves significant improvements over existing methods in both
performance and sample efficiency. This work opens a promising new direction
for applying discrete diffusion models in RL to tackle long-standing challenges in
large-scale combinatorial action spaces.View details
Proceedings of the 43rd International Conference on Machine Learning (ICML-26), Seoul, South Korea (2026)
Preview abstract
Controllable generation with diffusion models is often treated as a collection of heuristics rather than a unified optimization problem. We propose a principled control formulation by viewing the diffusion reverse process as an instance of a (generalized) linearly-solvable Markov decision process (LS-MDP). This perspective turns controllable generation into regularized optimal control around a pretrained diffusion policy, yielding tractable objectives and algorithmic updates.
Under this framework, we study two practical finetuning regimes. When paired target data are available, we obtain a supervised finetuning (SFT) objective. When only a terminal reward model is available, we derive reinforcement-learning finetuning (RLFT) methods from the LS-MDP solution structure, including (i) a reward-weighted regression loss and (ii) a policy-gradient approach (with standard extensions such as PPO).
Crucially, the LS-MDP optimality conditions imply an explicit relationship between the optimal and pretrained score functions. We leverage this to derive a new score-function parameterization that isolates the control signal and enables “gray-box” finetuning with substantially fewer trainable parameters. Experiments across SFT and RLFT show this parameterization improves over existing finetuning baselines while achieving stronger sample/parameter efficiency.View details
Proceedings of the 13th International Conference on Learning Representations (ICLR-25), Singapore (2025)
Preview abstract
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning (RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input—a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.View details
Preview abstract
The alignment of language models (LMs) with human values increasingly relies on using other LMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on deterministic preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a truly reliable autorater must learn to model the full distribution of preference defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: direct supervised fine-tuning for dense, probabilistic labels, and a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.View details
Advances in Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.View details
Preview abstract
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt the score-based modeling to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.View details
Preview abstract
A goal for multi-task learning from a multi-objective optimization perspective is to find the Pareto solutions that are not dominated by others. In this paper, we provide some insights on understanding the trade-off between Pareto efficiency and generalization, as a result of parameterization in deep learning: as a multi-objective optimization problem, enough parameterization is needed for handling task conflicts in a constrained solution space; however, from a multi-task generalization perspective, over-parameterization undermines the benefit of learning a shared representation which helps harder tasks or tasks with limited training examples. A delicate balance between multi-task generalization and multi-objective optimization is therefore needed for finding a better trade-off between efficiency and generalization. To this end, we propose a method of under-parameterized self-auxiliaries for multi-task models to achieve the best of both worlds. It is model-agnostic, task-agnostic and works with other multi-task learning algorithms. Empirical results show our method improves Pareto efficiency over existing popular algorithms on several multi-task applications.View details
Preview abstract
Stochastic dual dynamic programming~(SDDP) is one of the state-of-the-art algorithm for multi-stage stochastic optimization, yet its cost exponentially increases w.r.t. the size of decision variables, therefore, quickly becomes inapplicable for high-dimension problems. We introduce a neuralized component into SDDP, which outputs a \emph{piece-wise linear function} in a \emph{low-dimension} space to approximate the value function, based on the \emph{context of the problem instances}. The neuralized component will consistently evolve to abstract effective low-dimension action space and improve the quality of value function approximation for each problem based on prior successful experiences. It is seamlessly integrated with SDDP, formed our neural enhanced solver,~\AlgName~(\algshort), which achieves the optimality \emph{without loss of accuracy} in \emph{faster speed} for high-dimension and long-horizon multi-stage stochastic optimizations. We conduct thorough empirical experiments to demonstrate the benefits of \algshort from transferability on scalability.~\algshort significantly outperforms the competitors, including SDDP and variants of RL algorithms, in terms of solution quality and feasibility, and computational speed.View details
Preview abstract
Retrosynthesis is the process of identifying a set of reactants to synthesize a target molecule. It is critical to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. However, the inner connections of these models are rarely discussed, and rigorous evaluations of these models are largely in need. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) with different energy functions. This unified view establishes connections and reveals the differences between models, thereby enhances our understanding of model design. We also provide a comprehensive assessment of performance to the community. Additionally, we present a novel dual variant within the framework that performs consistent training to induce the agreement between forward- and backward-prediction. This model improves the state-of-the-art of template-free methods with or without reaction types.View details
Preview abstract
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.View details