KAIST AIPR Lab Artificial Intelligence & Probabilistic Reasoning Lab

Research

연구실 지원에 관하여 관심이 있으신 학생들은 지도교수 이메일로 문의바랍니다.

We are seeking highly motivated student researchers (MS/PhD program students or post docs). For those who are interested, email the professor here.

Home

Featured Recent Publications

Yunseon Choi, Junyoung Jang, Chaeyoung Oh, Minchan Jeong, Doohwan Hwang, Kee-Eung Kim: Group-Normalized Implicit Value Optimization for Language Models. International Conference on Learning Representations (ICLR). 2026. [📄 Abstract] [✏️ Paper] [📋 Poster]

Fine-tuning Large Language Models (LLMs) with reinforcement learning (RL) has become a key technique for enhancing performance on a wide range of tasks, from user alignment to complex reasoning. However, this approach is often hindered by the difficulty of fine-grained credit assignment, as it typically relies on sparse rewards given only at the end of a completely generated sequence. Conventional solutions often require training an auxiliary value network known as critic, which introduces significant computational overhead and training instability. We present Group-Normalized Implicit Value Optimization (GN-IVO), a novel, critic-free algorithm that directly addresses this challenge. GN-IVO learns step-level values implicitly from the policy through a group-normalized distributional matching objective. This approach elegantly circumvents the need for an explicit critic and avoids the computation of the intractable partition function by normalizing values across a group of sampled model responses. Theoretically, we prove that our objective recovers the true value function up to a constant, guaranteeing that the optimal policy is preserved. We demonstrate the practical effectiveness of GN-IVO on a diverse set of text generation and reasoning tasks, showing that it consistently outperforms strong RL baselines for LLMs.

Joo Bon Maeng*, Seongmin Lee*, Seokin Seo, Kee-Eung Kim: Goal-Conditioned DPO: Prioritizing Safety in Misaligned Instructions. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). 2025. [📄 Abstract] [✏️ Paper]

Large language models (LLMs) undergo extensive safety training to maximize both helpfulness and harmlessness in their responses. However, various jailbreak attacks jeopardize model safety, allowing malicious actors to bypass safety guidelines. Existing defense methods primarily focus on aligning the model's output towards less harmful responses through post-processing or input perturbation. Consequently, these approaches are prone to general performance degradation and lack the ability to defend against a wide variety of attacks. In this paper, we propose goal-conditioned direct preference optimization (GC-DPO), which is trained to prioritize the system prompt over the user prompt through goal-conditioning, and thus enables a good balance between safety and performance. Empirically, we show that our approach significantly reduces the average Attack Success Rate (ASR) on a wide variety of jailbreak attacks. In particular, GC-DPO achieves a reduction of 67.1% to 5.0% in ASR for Vicuna-7B, a state-of-the-art result, without compromising the model's general performance.

Jungwoo Park*, Young Jin Ahn*, Kee-Eung Kim, Jaewoo Kang: Monet: Mixture of Monosemantic Experts for Transformers. International Conference on Learning Representations (ICLR). 2025. [📄 Abstract] [✏️ Paper]

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity—where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior.

Yunseon Choi, Minchan Jeong, Soobin Um, Kee-Eung Kim: DPAIL: Training Diffusion Policy for Adversarial Imitation Learning without Policy Optimization. Advances in Neural Information Processing Systems (NeurIPS). 2025. [📄 Abstract] [✏️ Paper]

Human experts employ diverse strategies to complete a task, producing to multi-modal demonstration data. Although traditional Adversarial Imitation Learning(AIL) methods have achieved notable success, they often collapse theses multimodal behaviors into a single strategy, failing to replicate expert behaviors. To overcome this limitation, we propose DPAIL, an adversarial IL framework that leverages diffusion models as a policy class to enhance expressiveness. Building on the Adversarial Soft Advantage Fitting (ASAF) framework, which removes the need for policy optimization steps, DPAIL trains a diffusion policy using a binary cross-entropy objective to distinguish expert trajectories from generated ones. To enable optimization of the diffusion policy, we introduce a novel, tractable lower bound on the policy’s likelihood. Through comprehensive quantitative and qualitative evaluations against various baselines, we demonstrate that our method not only captures diverse behaviors but also remains robust as the number of behavior modes increases.

Updates

Homepage Update - October 30, 2023