KAIST AIPR Lab Artificial Intelligence & Probabilistic Reasoning Lab

Research

연구실 지원에 관하여 관심이 있으신 학생들은 지도교수 이메일로 문의바랍니다.

We are seeking highly motivated student researchers (MS/PhD program students or post docs). For those who are interested, email the professor here.

Home

Featured Recent Publications

Joo Bon Maeng*, Seongmin Lee*, Seokin Seo, Kee-Eung Kim: Goal-Conditioned DPO: Prioritizing Safety in Misaligned Instructions. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). 2025. [📄 Abstract]

Large language models (LLMs) undergo extensive safety training to maximize both helpfulness and harmlessness in their responses. However, various jailbreak attacks jeopardize model safety, allowing malicious actors to bypass safety guidelines. Existing defense methods primarily focus on aligning the model's output towards less harmful responses through post-processing or input perturbation. Consequently, these approaches are prone to general performance degradation and lack the ability to defend against a wide variety of attacks. In this paper, we propose goal-conditioned direct preference optimization (GC-DPO), which is trained to prioritize the system prompt over the user prompt through goal-conditioning, and thus enables a good balance between safety and performance. Empirically, we show that our approach significantly reduces the average Attack Success Rate (ASR) on a wide variety of jailbreak attacks. In particular, GC-DPO achieves a reduction of 67.1% to 5.0% in ASR for Vicuna-7B, a state-of-the-art result, without compromising the model's general performance.

Jungwoo Park*, Young Jin Ahn*, Kee-Eung Kim, Jaewoo Kang: Monet: Mixture of Monosemantic Experts for Transformers. International Conference on Learning Representations (ICLR). 2025. [📄 Abstract] [✏️ Paper]

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity—where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior.

Seonghyun Ban*, Heesan Kong*, Kee-Eung Kim: Data Augmentation with Diffusion for Open-Set Semi-Supervised Learning. Advances in Neural Information Processing Systems (NeurIPS). 2024. [📄 Abstract] [✏️ Paper]

Semi-supervised learning (SSL) seeks to utilize unlabeled data to overcome the limited amount of labeled data and improve model performance. However, many SSL methods typically struggle in real-world scenarios, particularly when there is a large number of irrelevant instances in the unlabeled data that do not belong to any class in the labeled data. Previous approaches often downweight instances from irrelevant classes to mitigate the negative impact of class distribution mismatch on model training. However, by discarding irrelevant instances, they may result in the loss of valuable information such as invariance, regularity, and diversity within the data. In this paper, we propose a data-centric generative augmentation approach that leverages a diffusion model to enrich labeled data using both labeled and unlabeled samples. A key challenge is extracting the diversity inherent in the unlabeled data while mitigating the generation of samples irrelevant to the labeled data. To tackle this issue, we combine diffusion model training with a discriminator that identifies and reduces the impact of irrelevant instances. We also demonstrate that such a trained diffusion model can even convert an irrelevant instance into a relevant one, yielding highly effective synthetic data for training. Through a comprehensive suite of experiments, we show that our data augmentation approach significantly enhances the performance of SSL methods, especially in the presence of class distribution mismatch.

Seokin Seo, Byung-Jun Lee, Jongmin Lee, HyeongJoo Hwang, Hongseok Yang, Kee-Eung Kim: Mitigating Covariate Shift in Behavioral Cloning via Robust Stationary Distribution Correction. Advances in Neural Information Processing Systems (NeurIPS). 2024. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We consider offline imitation learning (IL), which aims to train an agent to imitate from the dataset of expert demonstrations without online interaction with the environment. Behavioral Cloning (BC) has been a simple yet effective approach to offline IL, but it is also well-known to be vulnerable to the covariate shift resulting from the mismatch between the state distributions induced by the learned policy and the expert policy. Moreover, as often occurs in practice, when expert datasets are collected from an arbitrary state distribution instead of a stationary one, these shifts become more pronounced, potentially leading to substantial failures in existing IL methods. Specifically, we focus on covariate shift resulting from arbitrary state data distributions, such as biased data collection or incomplete trajectories, rather than shifts induced by changes in dynamics or noisy expert actions. In this paper, to mitigate the effect of the covariate shifts in BC, we propose DrilDICE, which utilizes a distributionally robust BC objective by employing a stationary distribution correction ratio estimation (DICE) to derive a feasible solution. We evaluate the effectiveness of our method through an extensive set of experiments covering diverse covariate shift scenarios. The results demonstrate the efficacy of the proposed approach in improving the robustness against the shifts, outperforming existing offline IL methods in such scenarios.

Updates