Publications | KAIST AIPR Lab

2025

나준호, 김기응: 조건부 변분 오토인코더를 활용한 대규모 언어 모델의 추론 성능 향상 기법. 한국컴퓨터종합학술대회(KCC). 2025. [📄 Abstract] [✏️ Paper]

대규모 언어 모델의 추론 능력을 향상시키기 위한 테스트 단계 연산 방법 중 기존의 탐색 기반 방법이나 검증 기반 기법은 계산 비용이 높거나 대규모 모델에 의존해 유연성이 제한되는 문제가 있다. 본 연구에서는 조건부 변분 오토인코더를 통해 문제에 대한 정답으로 이르는 잠재적 추론 패턴을 모델링하여 효율적인 검증 절차를 제공하는 새로운 방법을 제안한다. 제안된 방법은 질문과 풀이과정 데이터를 조건부 변분 오토인코더로 학습하여 추론 패턴을 잠재 공간에 축적한 후, 테스트 단계에서 대규모 언어 모델이 생성한 여러 후보 답안 중 학습된 추론 패턴과 가장 잘 부합하는 답변을 엔트로피 기반으로 선택한다. GSM8K와 MATH 수학 추론 데이터셋에서 대규모 언어 모델을 대상으로 한 실험 결과, 제안된 방법은 기존의 방법들과 대비하여 모든 조합에서 일관되게 상회하는 성능을 보였다. 이러한 결과는 조건부 변분 오토인코더를 활용한 효율적인 훈련 및 계산만으로 대규모 언어 모델의 추론 성능을 향상시킬 수 있음을 실증적으로 보여준다.

장준영, 김기응: 순위 가중 보상을 활용한 대규모 언어 모델의 강화학습 기반 최적화. 한국컴퓨터종합학술대회(KCC). 2025. [📄 Abstract] [✏️ Paper]

수학 문제 풀이와 같은 추론 중심 과제에서의 언어 모델 최적화를 위해 기존 GRPO 방식의 한계를 개선한 critic-free 정책 학습 알고리즘을 제안한다. 기존 방식은 그룹 내 top-1 응답에만 학습 신호가 집중되고, 나머지 샘플은 그래디언트에 기여하지 못해 학습 분산이 증가하고 샘플 효율성이 저하되는 문제가 있다. 이를 해결하기 위해 본 논문에서는 보상 기반 응답 순위를 부드럽게 확률화하여 모든 샘플이 기여할 수 있도록 하는 학습 구조를 설계하였다.

Joo Bon Maeng*, Seongmin Lee*, Seokin Seo, Kee-Eung Kim: Goal-Conditioned DPO: Prioritizing Safety in Misaligned Instructions. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). 2025. [📄 Abstract] [✏️ Paper]

Large language models (LLMs) undergo extensive safety training to maximize both helpfulness and harmlessness in their responses. However, various jailbreak attacks jeopardize model safety, allowing malicious actors to bypass safety guidelines. Existing defense methods primarily focus on aligning the model's output towards less harmful responses through post-processing or input perturbation. Consequently, these approaches are prone to general performance degradation and lack the ability to defend against a wide variety of attacks. In this paper, we propose goal-conditioned direct preference optimization (GC-DPO), which is trained to prioritize the system prompt over the user prompt through goal-conditioning, and thus enables a good balance between safety and performance. Empirically, we show that our approach significantly reduces the average Attack Success Rate (ASR) on a wide variety of jailbreak attacks. In particular, GC-DPO achieves a reduction of 67.1% to 5.0% in ASR for Vicuna-7B, a state-of-the-art result, without compromising the model's general performance.

Jungwoo Park*, Young Jin Ahn*, Kee-Eung Kim, Jaewoo Kang: Monet: Mixture of Monosemantic Experts for Transformers. International Conference on Learning Representations (ICLR). 2025. [📄 Abstract] [✏️ Paper]

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity—where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior.

2024

Seonghyun Ban*, Heesan Kong*, Kee-Eung Kim: Data Augmentation with Diffusion for Open-Set Semi-Supervised Learning. Advances in Neural Information Processing Systems (NeurIPS). 2024. [📄 Abstract] [✏️ Paper]

Semi-supervised learning (SSL) seeks to utilize unlabeled data to overcome the limited amount of labeled data and improve model performance. However, many SSL methods typically struggle in real-world scenarios, particularly when there is a large number of irrelevant instances in the unlabeled data that do not belong to any class in the labeled data. Previous approaches often downweight instances from irrelevant classes to mitigate the negative impact of class distribution mismatch on model training. However, by discarding irrelevant instances, they may result in the loss of valuable information such as invariance, regularity, and diversity within the data. In this paper, we propose a data-centric generative augmentation approach that leverages a diffusion model to enrich labeled data using both labeled and unlabeled samples. A key challenge is extracting the diversity inherent in the unlabeled data while mitigating the generation of samples irrelevant to the labeled data. To tackle this issue, we combine diffusion model training with a discriminator that identifies and reduces the impact of irrelevant instances. We also demonstrate that such a trained diffusion model can even convert an irrelevant instance into a relevant one, yielding highly effective synthetic data for training. Through a comprehensive suite of experiments, we show that our data augmentation approach significantly enhances the performance of SSL methods, especially in the presence of class distribution mismatch.

Seokin Seo, Byung-Jun Lee, Jongmin Lee, HyeongJoo Hwang, Hongseok Yang, Kee-Eung Kim: Mitigating Covariate Shift in Behavioral Cloning via Robust Stationary Distribution Correction. Advances in Neural Information Processing Systems (NeurIPS). 2024. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We consider offline imitation learning (IL), which aims to train an agent to imitate from the dataset of expert demonstrations without online interaction with the environment. Behavioral Cloning (BC) has been a simple yet effective approach to offline IL, but it is also well-known to be vulnerable to the covariate shift resulting from the mismatch between the state distributions induced by the learned policy and the expert policy. Moreover, as often occurs in practice, when expert datasets are collected from an arbitrary state distribution instead of a stationary one, these shifts become more pronounced, potentially leading to substantial failures in existing IL methods. Specifically, we focus on covariate shift resulting from arbitrary state data distributions, such as biased data collection or incomplete trajectories, rather than shifts induced by changes in dynamics or noisy expert actions. In this paper, to mitigate the effect of the covariate shifts in BC, we propose DrilDICE, which utilizes a distributionally robust BC objective by employing a stationary distribution correction ratio estimation (DICE) to derive a feasible solution. We evaluate the effectiveness of our method through an extensive set of experiments covering diverse covariate shift scenarios. The results demonstrate the efficacy of the proposed approach in improving the robustness against the shifts, outperforming existing offline IL methods in such scenarios.

Seongmin Lee*, Jaewook Shin*, Young Jin Ahn, Seokin Seo, Ohjoon Kwon, Kee-Eung Kim: Zero-Shot Multi-Hop Question Answering via Monte-Carlo Tree Search with Large Language Models. arXiv. 2024. [📄 Abstract] [✏️ Paper]

Recent advances in large language models (LLMs) have significantly impacted the domain of multi-hop question answering (MHQA), where systems are required to aggregate information and infer answers from disparate pieces of text. However, the autoregressive nature of LLMs inherently poses a challenge as errors may accumulate if mistakes are made in the intermediate reasoning steps. This paper introduces Monte-Carlo tree search for Zero-shot multi-hop Question Answering (MZQA), a framework based on Monte-Carlo tree search (MCTS) to identify optimal reasoning paths in MHQA tasks, mitigating the error propagation from sequential reasoning processes. Unlike previous works, we propose a zero-shot prompting method, which relies solely on instructions without the support of hand-crafted few-shot examples that typically require domain expertise. We also introduce a behavioral cloning approach (MZQA-BC) trained on self-generated MCTS inference trajectories, achieving an over 10-fold increase in reasoning speed with bare compromise in performance. The efficacy of our method is validated on standard benchmarks such as HotpotQA, 2WikiMultihopQA, and MuSiQue, demonstrating that it outperforms existing frameworks.

Oh Joon Kwon, Daiki E. Matsunaga, Kee-Eung Kim: GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets. Empirical Methods in Natural Language Procesesing (EMNLP). 2024. [📄 Abstract] [✏️ Paper]

A critical component of the current generation of language models is preference alignment, which aims to precisely control the model's behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF) and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences. In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO can generate far more diverse responses than the baseline methods that are still relatively aligned with human values in dialog generation and summarization tasks.

Young Jin Ahn*, Jungwoo Park*, Sangha Park, Jonghyun Choi, Kee-Eung Kim: SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. Interspeech. 2024. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes—visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.

Yunseon Choi, Sangmin Bae, Seonghyun Ban, Minchan Jeong, Chuheng Zhang, Lei Song, Li Zhao, Jiang Bian, Kee-Eung Kim: Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL. Association for Computational Linguistics (ACL). 2024. Oral presentation [📄 Abstract] [✏️ Paper]

With the advent of foundation models, prompt tuning has positioned itself as an important technique for directing model behaviors and eliciting desired responses. Prompt tuning regards selecting appropriate keywords included into the input, thereby adapting to the downstream task without adjusting or fine-tuning the model parameters. There is a wide range of work in prompt tuning, from approaches that directly harness the backpropagated gradient signals from the model, to those employing blackbox optimization such as reinforcement learning (RL) methods. Our primary focus is on RLPrompt, which aims to find optimal prompt tokens leveraging soft Q-learning. While the results show promise, we have observed that the prompts frequently appear unnatural, which impedes their interpretability. We address this limitation by using sparse Tsallis entropy regularization, a principled approach to filtering out unlikely tokens from consideration. We extensively evaluate our approach across various tasks, including few-shot text classification, unsupervised text style transfer, and textual inversion from images. The results indicate a notable improvement over baselines, highlighting the efficacy of our approach in addressing the challenges of prompt tuning. Moreover, we show that the prompts discovered using our method are more natural and interpretable compared to those from other baselines (Deng et al., 2022)

Yunseon Choi, Li Zhao, Chuheng Zhang, Lei Song, Jiang Bian, Kee-Eung Kim: Diversification of Adaptive Policy for Effective Offline Reinforcement Learning. International Joint Conference on Artificial Intelligence (IJCAI). 2024. [📄 Abstract] [✏️ Paper]

Offline Reinforcement Learning (RL) aims to learn policies from pre-collected datasets that capture only a subset of the environment's dynamics. The predominant approach has been to solve a constrained optimization formulation, which ensures that the policy visits state-action pairs within the support of the offline dataset. However, this approach has limited the ability to make decisions when the agent faces unknown parts of the environment at deployment time. To address the challenge of decision-making in out-of-support regions, model-based Bayes-adaptive approaches have been proposed by considering all dynamics models that could potentially be the true environment. Since it is generally infeasible to compute the posterior of all dynamics models based on the offline dataset, these approaches usually approximate the posterior by using a finite ensemble of highly probable dynamics models. Hence, the diversity of these models is the key to obtaining good policies. In this work, we propose MoDAP (Model-based Diverse Adaptive Policy Learning), an algorithm to enable the adaptive policy to make informed decisions in previously unexplored states. MoDAP adopts an iterative strategy that simultaneously training the policy and dynamics models. The policy optimization seeks to maximize expected returns across dynamics models, while the dynamics models are trained to promote policy diversification through the proposed information-theoretic objective. We evaluate MoDAP through experiments on the D4RL and NeoRL benchmarks, showcasing its performance superiority over state-of-the-art algorithms.

최윤선, 황두환, 김기응: 디퓨전 모델의 소량 데이터 학습자로서 활용한 다양한 데모 데이터로부터의 오프라인 모사 학습. 한국컴퓨터종합학술대회(KCC). 2024. [📄 Abstract] [✏️ Paper]

강화학습은 다양한 분야에서 환경과의 상호작용을 통해 최적의 행동정책을 찾아내는 방법으로 활용되고 있다. 그러나 실제환경에서는 환경과의 실시간 상호작용이 제한되며 에이전트의 행동에 대한 보상정보의 지속적인 제공이 어렵다. 본 연구에서는 환경과의 상호작용이 제한되고 전문가의 행동데이터가 소량으로 제공되는 상황 속에서, 전문가의 행동을 효과적으로 모사할 수 있는 모델을 구현하는 것을 목표로 한다. 이를 위해 본 논문에서는 생성모델 중 디퓨전 모델을 활용하며, 배치 데이터 내의 모든 행동을 생성하면서도 샘플링 과정에서 특정 가이던스를 제공하여 전문가 행동을 효과적으로 모방할 수 있도록 모델을 설계하였다.

Haanvid Lee, Tri Wahyu Guntara, Jongmin Lee, Yung-Kyun Noh, Kee-Eung Kim: Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies. International Conference on Learning Representations (ICLR). 2024. Spotlight [📄 Abstract] [✏️ Paper]

We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric. In empirical studies using various test domains, we show that the OPE with in-sample learning using the kernel with optimized metric achieves significantly improved accuracy than other baselines.

Kyungsik Lee, Hana Yoo, Sumin Shin, Wooyoung Kim, Yeonung Baek, Hyunjin Kang, Jaehyun Kim, Kee-Eung Kim: A Submodular Optimization Approach to Accountable Loan Approval. IAAI Technical Track on Deployed Highly Innovative Applications of AI (IAAI). 2024. Innovative application award [📄 Abstract] [✏️ Paper]

In the field of finance, the underwriting process is an essential step in evaluating every loan application. During this stage, the borrowers' creditworthiness and ability to repay the loan are assessed to ultimately decide whether to approve the loan application. One of the core components of underwriting is credit scoring, in which the probability of default is estimated. As such, there has been significant progress in enhancing the predictive accuracy of credit scoring models through the use of machine learning, but there still exists a need to ultimately construct an approval rule that takes into consideration additional criteria beyond the score itself. This construction process is traditionally done manually to ensure that the approval rule remains interpretable to humans. In this paper, we outline an automated system for optimizing a rule-based system for approving loan applications, which has been deployed at Hyundai Capital Services (HCS). The main challenge lay in creating a high-quality rule base that is simultaneously simple enough to be interpretable by risk analysts as well as customers, since the approval decision should be accountable. We addressed this challenge through principled submodular optimization. The deployment of our system has led to a 14% annual growth in the volume of loan services at HCS, while maintaining the target bad rate, and has resulted in the approval of customers who might have otherwise been rejected.

Sungyoon Kim, Yunseon Choi, Daiki E. Matsunaga, Kee-Eung Kim: Stitching Sub-Trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL. AAAI Conference on Artificial Intelligence. 2024. [📄 Abstract] [✏️ Paper]

Offline Goal-Conditioned Reinforcement Learning (Offline GCRL) is an important problem in RL that focuses on acquiring diverse goal-oriented skills solely from pre-collected behavior datasets. In this setting, the reward feedback is typically absent except when the goal is achieved, which makes it difficult to learn policies especially from a finite dataset of suboptimal behaviors. In addition, realistic scenarios involve long-horizon planning, which necessitates the extraction of useful skills within sub-trajectories. Recently, the conditional diffusion model has been shown to be a promising approach to generate high-quality long-horizon plans for RL. However, their practicality for the goal-conditioned setting is still limited due to a number of technical assumptions made by the methods. In this paper, we propose SSD (Subtrajectory Stitching with Diffusion), a model-based Offline GCRL method that leverages the conditional diffusion model to address these limitations. In summary, we use the diffusion model that generates future plans conditioned on the target goal and value, with the target value estimated from the goal-relabeled Offline dataset. We report state-of-the-art performance in the standard benchmark set of GCRL tasks, and demonstrate the capability to successfully stitch the segments of suboptimal trajectories in the Offline data to generate high-quality plans.

윤재석, 황승현, 김기응: 대규모 언어 모델을 활용한 음성 기반 대화 시스템의 가능성 연구. 한국통신학회. 2024. [📄 Abstract] [✏️ Paper]

최근 대규모 언어 모델(Large Language Models; LLMs)은 텍스트 기반 대화 시스템에서 뛰어난 성능을 보여주며, 기존의 접근방식을 대체할 가능성을 보여주고 있다. 그러나 음성데이터를 활용하는 시나리오에서 LLMs의 활용방안과 가능성은 아직 충분히 탐구되지 않았다. 본 연구는 음성 관련 데이터가 추가된 MultiWOZ 데이터셋과 자동음성인식(Automatic SpeechRecognition; ASR)모델을 활용하여, LLMs가 음성 기반 데이터에서 대화 상태를 추적할 수 있는지를 대화 상태 추적(Dialog State Tracking; DST)작업을 통해 분석한다. 이 연구는 LLMs가 음성 대화 시스템에서 어떠한 역할을 할 수 있는지 탐색하고, 다양한 크기의 ASR 모델과 함께 LLMs의 성능을 평가 및 분석함으로써, 음성 기반 대화 시스템에서의 잠재적인 효과와 역할을 밝힌다.

2023

최윤선, 반성현, 김기응: 해석 가능한 프롬프트 최적화에 관한 강화학습 연구. 한국소프트웨어종합학술대회(KSC) 2023, 한국정보과학회. 2023. [📄 Abstract] [✏️ Paper]

거대 언어 모델은 다양한 자연어 태스크에 효과적으로 적용될 수 있는 잠재력을 지니고 있으나, 사전 훈련된 대규모의 매개변수를 특정 하위 작업에 맞게 업데이트하는 기존의 파인튜닝은 대량의 컴퓨팅 자원을 요구한다. 이에 모든 매개 변수를 업데이트 하는 대신 기존 입력에 프롬프트를 추가하여 이를 학습하는 여러가지 프롬프팅 방법론들이 제시되어 왔으나, 학습된 프롬프트를 사람이 보았을 때 해석하는 것이 불가능하거나 친숙하지 않다는 공통적인 한계점이 있었다. 본 논문에서는 이러한 한계점을 해결하기 위해, 강화학습 분야의 행동 모사 학습을 기반으로 사람에게 친숙한 예제 프롬 프트를 활용하는 해석 가능한 프롬프트 최적화 알고리즘을 제안한다.

Mihye Kim, Jimyung Choi, Jaehyun Kim, Wooyoung Kim, Yeonung Baek, Gisuk Bang, Kwangwoon Son, Yeonman Ryou, Kee-Eung Kim: Trustworthy residual vehicle value prediction for auto finance. AI Magazine. 2023. [📄 Abstract] [✏️ Paper]

The residual value (RV) of a vehicle refers to its estimated worth at some point in the future. It is a core component in every auto finance product, used to determine the credit lines and the leasing rates. As such, an accurate prediction of RV is critical for the auto finance industry, since it can pose a risk of revenue loss by over-prediction or make the financial product incompetent through under-prediction. Although there are a number of prior studies on training machine learning models on a large amount of used car sales data, we had to cope with real-world operational requirements such as compliance with regulations (i.e., monotonicity of output with respect to a subset of features) and generalization to unseen input (i.e., new and rare car models). In this paper, we describe how we addressed these practical challenges and created value for our business at Hyundai Capital Services, the top auto financial service provider in Korea.

Haeju Lee*, Minchan Jeong*, Se-Young Yun, and Kee-Eung Kim: Bayesian Multi-Task Transfer Learning for Soft Prompt Tuning. Findings of Empirical Methods in Natural Language Processing (EMNLP). 2023. [📄 Abstract] [✏️ Paper]

Prompt tuning, in which prompts are optimized to adapt large-scale pre-trained language models to downstream tasks instead of fine-tuning the full model parameters, has been shown to be particularly effective when the prompts are trained in the multi-task transfer learning setting. These methods generally involve individually training prompts for each source task and then aggregating them to provide the initialization of the prompt for the target task. However, this approach critically ignores the fact that some of the source tasks could be negatively or positively interfering with each other. We argue that when we extract knowledge from source tasks via training source prompts, we need to consider this correlation among source tasks for better transfer to target tasks. To this end, we propose a Bayesian approach where we work with the posterior distribution of prompts across source tasks. We obtain representative source prompts corresponding to the samples from the posterior utilizing Stein Variational Gradient Descent, which are then aggregated to constitute the initial target prompt. We show extensive experimental results on the standard benchmark NLP tasks, where our Bayesian multi-task transfer learning approach outperforms the state-of-the-art methods in many settings. Furthermore, our approach requires no auxiliary models other than the prompt itself, achieving high degree of parameter-efficiency.

Seokin Seo, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim: Regularized Behavior Cloning for Blocking the Leakage of Past Action Information. Advances in Neural Information Processing Systems (NeurIPS). 2023. Spotlight [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

For partially observable environments, imitation learning with observation histories (ILOH) assumes that control-relevant information is sufficiently captured in the observation histories for imitating the expert actions. In the offline setting wherethe agent is required to learn to imitate without interaction with the environment, behavior cloning (BC) has been shown to be a simple yet effective method for imitation learning. However, when the information about the actions executed in the past timesteps leaks into the observation histories, ILOH via BC often ends up imitating its own past actions. In this paper, we address this catastrophic failure by proposing a principled regularization for BC, which we name Past Action Leakage Regularization (PALR). The main idea behind our approach is to leverage the classical notion of conditional independence to mitigate the leakage. We compare different instances of our framework with natural choices of conditional independence metric and its estimator. The result of our comparison advocates the use of a particular kernel-based estimator for the conditional independence metric. We conduct an extensive set of experiments on benchmark datasets in order to assess the effectiveness of our regularization method. The experimental results show that our method significantly outperforms prior related approaches, highlighting its potential to successfully imitate expert actions when the past action information leaks into the observation histories.

Daiki E. Matsunaga*, Jongmin Lee*, Jaeseok Yoon, Stefanos Leonardos, Pieter Abbeel, and Kee-Eung Kim: AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation. Advances in Neural Information Processing Systems (NeurIPS). 2023. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents.To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE,an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

Jaeseok Yoon*, Seunghyun Hwang*, Ran Han, Jeonguk Bang, and Kee-Eung Kim: Adapting Text-based Dialogue State Tracker for Spoken Dialogues. Special Interest Group on Discourse and Dialogue (SIGDIAL) DSTC11 Workshop. 2023. Track best paper [📄 Abstract] [✏️ Paper]

Although there have been remarkable advances in dialogue systems through the dialogue systems technology competition (DSTC), it remains one of the key challenges to building a robust task-oriented dialogue system with a speech interface. Most of the progress has been made for text-based dialogue systems since there are abundant datasets with written corpora while those with spoken dialogues are very scarce. However, as can be seen from voice assistant systems such as Siri and Alexa, it is of practical importance to transfer the success to spoken dialogues. In this paper, we describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11. Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value. Our experiments show that it is important to use an explicit automatic speech recognition error correction module, post-processing, and data augmentation to adapt a text-based dialogue state tracker for spoken dialogue corpora.

HyeongJoo Hwang, Seokin Seo, Youngsoo Jang, Sungyoon Kim, Geon-Hyeong Kim, Seunghoon Hong, and Kee-Eung Kim: Information-Theoretic State Space Model for Multi-View Reinforcement Learning. Proceedings of International Conference on Machine Learning (ICML). 2023. Oral presentation [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Multi-View Reinforcement Learning (MVRL) seeks to find an optimal control for an agent given multi-view observations from various sources. Despite recent advances in multi-view learning that aim to extract the latent representation from multi-view data, it is not straightforward to apply them to control tasks, especially when the observations are temporally dependent on one another. The problem can be even more challenging if the observations are intermittently missing for a subset of views. In this paper, we introduce Fuse2Control (F2C), an information-theoretic approach to capturing the underlying state space model from the sequences of multi-view observations. We conduct an extensive set of experiments in various control tasks showing that our method is highly effective in aggregating task-relevant information across many views, that scales linearly with the number of views while retaining robustness to arbitrary missing view scenarios.

Mihye Kim, Jimyung Choi, Jaehyun Kim, Wooyoung Kim, Yeonung Baek, Gisuk Bang, Kwangwoon Son, Yeonman Ryou, and Kee-Eung Kim: Trustworthy Residual Vehicle Value Prediction for Auto Finance. Proceedings of IAAI Technical Track on deployed Highly Innovative Applications of AI. 2023. Innovative application award [📄 Abstract] [✏️ Paper]

The residual value (RV) of a vehicle refers to its estimated worth at some point in the future. It is a core component in every auto financial product, used to determine the credit lines and the leasing rates. As such, an accurate prediction of RV is critical for the auto finance industry, since it can pose a risk of revenue loss by over-prediction or make the financial product incompetent by under-prediction. Although there are a number of prior studies on training machine learning models on a large amount of used car sales data, we had to cope with real-world operational requirements such as compliance with regulations (i.e. monotonicity of output with respect to a subset of features) and generalization to unseen input (i.e. new and rare car models). In this paper, we describe how we coped with these practical challenges and created value for our business at Hyundai Capital Services, the top auto financial service provider in Korea.

2022

서석인, 황형주, 양홍석, and 김기응: 특징조합 교란자 균형을 통한 인과정규화된 로지스틱 회귀 개선. 한국소프트웨어종합학술대회(KSC) 2022, 한국정보과학회. 2022. [📄 Abstract] [✏️ Paper]

안정적 학습(stable learning)은 훈련데이터와 테스트데이터 간의 분포변화(distribution shift)에 강건한 학습을 목표로 한다. 본 논문에서는 이진 분류문제에서 제시된 기존 안정적 학습 방법 중 하나인 인과정규화된 로지스틱 회귀(Causally-Regularized Logistic Regularization; CRLR)를 포함하는 더 일반적인 인과정규화 방법을 제시하고자 한다. 기존 알고리즘의 경우 각 특징(feature) 하나를 처방변수로 취급한 후 나머지 특징에 대해 교란자 균형(confounder balancing) 방법을 적용하여 샘플들의 가중치를 학습하였는데, 이를 확장하여 특징의 조합을 처방변수로 취급한 뒤 나머지 특징들의 균형을 맞추는 방법을 제안한다. 또한, 제안하는 방법이 기존 방법보다 효과적임을 간단한 실험을 통해 보인다.

Geon-Hyeong Kim*, Jongmin Lee*, Youngsoo Jang, Hongseok Yang, and Kee-Eung Kim: LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimation. Advances in Neural Information Processing Systems (NeurIPS). 2022. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We consider the problem of learning from observation (LfO), in which the agent aims to mimic the expert's behavior from the state-only demonstrations by experts. We additionally assume that the agent cannot interact with the environment but has access to the action-labeled transition data collected by some agents with unknown qualities. This offline setting for LfO is appealing in many real-world scenarios where the ground-truth expert actions are inaccessible and the arbitrary environment interactions are costly or risky.In this paper, we present LobsDICE, an offline LfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions. Our algorithm solves a single convex minimization problem, which minimizes the divergence between the two statetransition distributions induced by the expert and the agent policy. Through an extensive set of offline LfO tasks, we show that LobsDICE outperforms strong baseline methods.

Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee, Yung-Kyun Noh, and Kee-Eung Kim: Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions. Advances in Neural Information Processing Systems (NeurIPS). 2022. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth optimization, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.

Sanghoon Myung, In Huh, Wonik Jang, Jae Myung Choe, Jisu Ryu, Daesin Kim, Kee-Eung Kim, and Changwook Jeong: PAC-Net: A Model Pruning Approach to Inductive Transfer Learning. Proceedings of International Conference on Machine Learning (ICML). 2022. [📄 Abstract] [✏️ Paper]

Inductive transfer learning aims to learn from a small amount of training data for the target task by utilizing a pre-trained model from the source task. Most strategies that involve large-scale deep learning models adopt initialization with the pre-trained model and fine-tuning for the target task. However, when using over-parameterized models, we can often prune the model without sacrificing the accuracy of the source task. This motivates us to adopt model pruning for transfer learning with deep learning models. In this paper, we propose PAC-Net, a simple yet effective approach for transfer learning based on pruning. PAC-Net consists of three steps: Prune, Allocate, and Calibrate (PAC). The main idea behind these steps is to identify essential weights for the source task, fine-tune on the source task by updating the essential weights, and then calibrate on the target task by updating the remaining redundant weights. Under the various and extensive set of inductive transfer learning experiments, we show that our method achieves state-of-the-art performance by a large margin.

Jinhyeon Kim and Kee-Eung Kim: Data Augmentation for Learning to Play in Text-Based Games. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2022. [📄 Abstract] [✏️ Paper]

Improving generalization in text-based games serves as a useful stepping-stone towards reinforcement learning (RL) agents with generic linguistic ability. Data augmentation for generalization in RL has shown to be very successful in classic control and visual tasks, but there is no prior work for text-based games. We propose Transition-Matching Permutation, a novel data augmentation technique for text-based games, where we identify phrase permutations that match as many transitions in the trajectory data. We show that applying this technique results in state-of-the-art performance in the Cooking Game benchmark suite for text-based games.

Haeju Lee*, Oh Joon Kwon*, Yunseon Choi*, Minho Park, Ran Han, Yoonhyung Kim, Jinhyeon Kim, Youngjune Lee, Haebin Shin, Kangwook Lee, and Kee-Eung Kim: Learning to Embed Multi-Modal Contexts for Situated Conversational Agents. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) Findings. 2022. [📄 Abstract] [✏️ Paper]

The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoder-decoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and was nominated runner-up for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.

Haeju Lee*, Oh Joon Kwon*, Yunseon Choi*, Jinhyeon Kim, Youngjune Lee, Ran Han, Yoonhyung Kim, Minho Park, Kangwook Lee, Haebin Shin, and Kee-Eung Kim: Tackling Situated Multi-Modal Task-Oriented Dialogs with a Single Transformer Model. AAAI Conference on Artificial Intelligence (AAAI) DSTC10 Workshop. 2022. Track best paper [📄 Abstract] [✏️ Paper]

The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 track in the Dialog System Technology Challenge 10 (DSTC10) aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned encoder-decoder that performs all four subtasks at once for efficiency. Moreover, we handle the multi-modality of the challenge by representing visual objects as special tokens whose joint embedding is learned via auxiliary tasks. Finally, we won in the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model. In particular, our model achieved 81.5% MRR, 71.2% R@1, 95.0% R@5, 98.2% R@10, and 1.9 mean rank in response retrieval task along with competitive results in all subtasks, setting a high bar for the state-of-the-art result in SIMMC 2.0.

Sunghoon Hong, Deunsol Yoon, and Kee-Eung Kim: Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning. International Conference on Learning Representations (ICLR). 2022. [📄 Abstract] [✏️ Paper]

Modular Reinforcement Learning, where the agent is assumed to be morphologically structured as a graph, for example composed of limbs and joints, aims to learn a policy that is transferable to a structurally similar but different agent. Compared to traditional Multi-Task Reinforcement Learning, this promising approach allows us to cope with inhomogeneous tasks where the state and action space dimensions differ across tasks. Graph Neural Networks are a natural model for representing the pertinent policies, but a recent work has shown that their multi-hop message passing mechanism is not ideal for conveying important information to other modules and thus a transformer model without morphological information was proposed. In this work, we argue that the morphological information is still very useful and propose a transformer policy model that effectively encodes such information. Specifically, we encode the morphological information in terms of the traversal-based positional embedding and the graph-based relational embedding. We empirically show that the morphological information is crucial for modular reinforcement learning, substantially outperforming prior state-of-the-art methods on multi-task learning as well as transfer learning settings with different state and action space dimensions.

Youngsoo Jang, Jongmin Lee, and Kee-Eung Kim: GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems. International Conference on Learning Representations (ICLR). 2022. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Training a task-oriented dialogue agent can be naturally formulated as offline reinforcement learning (RL) problem, where the agent aims to learn a conversational strategy to achieve user goals, only from a dialogue corpus. It is very challenging in terms of RL since the natural language action space is astronomical, while feasible (syntactically and semantically correct) actions are very sparse. Thus, standard RL methods easily fail and generate responses diverging from human language, even when fine-tuning a powerful pre-trained language model. In this paper, we introduce GPT-Critic, an offline RL method for task-oriented dialogue. GPT-Critic is built upon GPT-2, fine-tuning the language model through behavior cloning of the critic-guided self-generated sentences. GPT-Critic is essentially free from the issue of diverging from human language since it learns from the sentences sampled from the pre-trained language model. In the experiments, we demonstrate that our algorithm outperforms the state-of-the-art in the task-oriented dialogue benchmarks including MultiWOZ 2.0 and ConvLab.

Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim: DemoDICE: Offline Imitation Learning with Supplementary Imperfect Demonstrations. International Conference on Learning Representations (ICLR). 2022. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We consider offline imitation learning (IL), which aims to mimic the expert's behavior from its demonstration without further interaction with the environment. One of the main challenges in offline IL is to deal with the narrow support of the data distribution exhibited by the expert demonstrations that cover only a small fraction of the state and the action spaces. As a result, offline IL algorithms that rely only on expert demonstrations are very unstable since the situation easily deviates from those in the expert demonstrations. In this paper, we assume additional demonstration data of unknown degrees of optimality, which we call imperfect demonstrations. Compared with the recent IL algorithms that adopt adversarial minimax training objectives, we substantially stabilize overall learning process by reducing minimax optimization to a direct convex optimization in a principled manner. Using extensive tasks, we show that DemoDICE achieves promising results in the offline IL from expert and imperfect demonstrations.

Jongmin Lee, Cosmin Paduraru, Daniel J. Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez: COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation. International Conference on Learning Representations (ICLR). 2022. Spotlight [📄 Abstract] [✏️ Paper]

We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.

2021

HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, and Kee-Eung Kim: Multi-View Representation Learning via Total Correlation Objective. Advances in Neural Information Processing Systems (NeurIPS). 2021. [📄 Abstract] [✏️ Paper]

Multi-View Representation Learning (MVRL) aims to discover a shared representation of observations from different views with the complex underlying correlation. In this paper, we propose a variational approach which casts MVRL as maximizing the amount of total correlation reduced by the representation, aiming to learn a shared latent representation that is informative yet succinct to capture the correlation among multiple views. To this end, we introduce a tractable surrogate objective function under the proposed framework, which allows our method to fuse and calibrate the observations in the representation space. From the information theoretic perspective, we show that our framework subsumes existing multi-view generative models. Lastly, we show that our approach straightforwardly extends to the Partial MVRL (PMVRL) setting, where the observations are missing without any regular pattern. We demonstrate the effectiveness of our approach in the multi-view translation and classification tasks, outperforming strong baseline methods.

김건형, 장영수, 이종민, and 김기응: 효율적인 다중태스크 오프라인 모델기반 강화학습 알고리즘에 대한 연구. 한국소프트웨어종합학술대회, 한국정보과학회. 2021. [📄 Abstract] [✏️ Paper]

오프라인 강화학습은 사전에 수집된 데이터로부터 환경과의 추가적인 상호작용 없이 정책을 학습하는 것을 목표로 한다. 하지만 이러한 프레임워크에서는 단일 태스크에서 사전에 많은 데이터가 수집되어 있어야 한다는 제약이 있으며, 이를 완화하기 위해 다양한 태스크에서 수집된 데이터들을 활용하는 다중태스크 오프라인 강화학습 문제를 생각해 볼 수 있다. 본 논문에서는 이러한 다중태스크 오프라인 강화학습 문제에서 다른 태스크의 데이터를 효율적으로 활용하는 모델기반 강화학습 알고리즘을 제안한다. 제안하는 알고리즘인 MT-OMBRL은 태스크 사이의 동역학 정보를 공유하여 각 태스크를 독립적으로 해결하는 오프라인 강화학습 알고리즘 대비 뛰어난 성능을 보인다.

Youngjune Lee, Oh Joon Kwon, Haeju Lee, Joonyoung Kim, Kangwook Lee, and Kee-Eung Kim: Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI. Neural Information Processing Systems (NeurIPS) Data-Centric AI workshop. 2021. Honorable mention [📄 Abstract] [✏️ Paper]

Data scarcity and noise are important issues in industrial applications of machine learning. However, it is often challenging to devise a scalable and generalized approach to address the fundamental distributional and semantic properties of dataset with black box models. For this reason, data-centric approaches are crucial for the automation of machine learning operation pipeline. In order to serve as the basis for this automation, we suggest a domain-agnostic pipeline for refining the quality of data in image classification problems. This pipeline contains data valuation, cleansing, and augmentation. With an appropriate combination of these methods, we could achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most Innovative) in the Data-Centric AI competition only with the provided dataset.

Youngjune Lee and Kee-Eung Kim: Dual Correction Strategy for Ranking Distillation in Top-N Recommender System. Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). 2021. [📄 Abstract] [✏️ Paper]

Knowledge Distillation (KD), which transfers the knowledge of a well-trained large model (teacher) to a small model (student), has become an important area of research for practical deployment of recommender systems. Recently, Relaxed Ranking Distillation (RRD) has shown that distilling the ranking information in the recommendation list significantly improves the performance. However, the method still has limitations in that 1) it does not fully utilize the prediction errors of the student model, which makes the training not fully efficient, and 2) it only distills the user-side ranking information, which provides an insufficient view under the sparse implicit feedback. This paper presents Dual Correction strategy for Distillation (DCD), which transfers the ranking information from the teacher model to the student model in a more efficient manner. Most importantly, DCD uses the discrepancy between the teacher model and the student model predictions to decide which knowledge to be distilled. By doing so, DCD essentially provides the learning guidance tailored to "correcting" what the student model has failed to accurately predict. This process is applied for transferring the ranking information from the user-side as well as the item-side to address sparse implicit user feedback. Our experiments show that the proposed method outperforms the state-of-the-art baselines, and ablation studies validate the effectiveness of each component.

Jongmin Lee*, Wonseok Jeon*, Byung-Jun Lee, Joelle Pineau, and Kee-Eung Kim: OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation. Proceedings of the International Conference on Machine Learning (ICML). 2021. [📄 Abstract] [✏️ Paper]

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, tightly integrates the optimization of the target policy and the stationary distribution ratio estimation of the target policy and the behavior policy. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Jongmin Lee*, Wonseok Jeon*, Byung-Jun Lee, Joelle Pineau, and Kee-Eung Kim: OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation. A Roadmap to Never-Ending RL Workshop at ICLR. 2021. [📄 Abstract] [✏️ Paper]

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, tightly integrates the optimization of the target policy and the stationary distribution ratio estimation of the target policy and the behavior policy. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Jinhyeon Kim, Donghoon Ham, Jeong-Gwan Lee, and Kee-Eung Kim: End-to-End Document-Grounded Conversation with Encoder-Decoder Pre-Trained Language Model. AAAI Conference on Artificial Intelligence (AAAI) DSTC9 Workshop. 2021. [📄 Abstract] [✏️ Paper]

The first track of the Ninth Dialog System Technology Challenge (DSTC9), “Beyond Domain APIs: Task-Oriented Conversational Modeling with Unstructured Knowledge Access,” encourages the participants to build goal-oriented dialog systems with access to unstructured knowledge, thereby making it possible to handle diverse user inquiries outside the scope of API/DBs. It consists of three sub-tasks: knowledgeseeking turn detection, knowledge selection, and knowledgegrounded response generation. We claim that tackling these sub-tasks separately is neither parameter-efficient nor of better performance. In this paper, we present an end-to-end document-grounded conversation system that utilizes a pretrained language model with an encoder-decoder structure. In the human evaluation, our dialog system achieved the accuracy score of 4.3082 and the appropriateness score of 4.2665, which ranked 9th out of 24 participant teams. Furthermore, we conduct an ablation study and show that the end-to-end encoder-decoder scheme enables more efficient use of parameters in the document-grounded conversation setting.

Deunsol Yoon*, Sunghoon Hong*, Byung-Jun Lee, and Kee-Eung Kim: Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic. International Conference on Learning Representations (ICLR). 2021. Spotlight [📄 Abstract] [✏️ Paper]

Safe and reliable electricity transmission in power grids is crucial for modern society. It is thus quite natural that there has been a growing interest in the automatic management of power grids, exempliﬁed by the Learning to Run a Power Network Challenge (L2RPN), modeling the problem as a reinforcement learning (RL) task. However, it is highly challenging to manage a real-world scale power grid, mostly due to the massive scale of its state and action space. In this paper, we present an off-policy actor-critic approach that effectively tackles the unique challenges in power grid management by RL, adopting the hierarchical policy together with the afterstate representation. Our agent ranked ﬁrst in the latest challenge (L2RPN WCCI 2020), being able to avoid disastrous situations while maintaining the highest level of operational efﬁciency in every test scenarios. This paper provides a formal description of the algorithmic aspect of our approach, as well as further experimental studies on diverse power grids.

Youngsoo Jang, Seokin Seo, Jongmin Lee, and Kee-Eung Kim: Monte-Carlo Planning and Learning with Language Action Value Estimates. International Conference on Learning Representations (ICLR). 2021. [📄 Abstract] [✏️ Paper]

Interactive Fiction (IF) games provide a useful testbed for language-based reinforcement learning agents, posing significant challenges of natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space. Agents based on standard planning algorithms struggle to play IF games due to the massive search space of language actions. Thus, language-grounded planning is a key ability of such agents, since inferring the consequence of language action based on semantic understanding can drastically improve search. In this paper, we introduce Monte-Carlo planning with Language Action Value Estimates (MC-LAVE) that combines a Monte-Carlo tree search with language-driven exploration. MC-LAVE invests more search effort into semantically promising language actions using locally optimistic language value estimates, yielding a significant reduction in the effective search space of language actions. We then present a reinforcement learning approach via MC-LAVE, which alternates between MC-LAVE planning and supervised learning of the self-generated language actions. In the experiments, we demonstrate that our method achieves new high scores in various IF games.

Byung-Jun Lee, Jongmin Lee, and Kee-Eung Kim: Representation Balancing Offline Model-based Reinforcement Learning. International Conference on Learning Representations (ICLR). 2021. [📄 Abstract] [✏️ Paper]

One of the main challenges in offline and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift, which has been first studied by Representation Balancing MDP (RepBM). Although this prior work has shown promising results, there are a number of shortcomings that still hinder its applicability to practical tasks. In particular, we address the curse of horizon exhibited by RepBM, rejecting most of the pre-collected data in long-term tasks. We present a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections. This effectively overcomes the aforementioned limitation of RepBM, as well as naturally extending to continuous action spaces and stochastic policies. We also present an offline model-based policy optimization using this new objective, yielding the state-of-the-art performance in a representative set of benchmark offline RL tasks.

2020

이병준, 이종민, 최윤선, 장영수, and 김기응: 효율적인 평생학습 알고리즘의 모델기반 강화학습 적용에 관한 연구. 한국소프트웨어종합학술대회, 한국정보과학회. 2020. [📄 Abstract] [✏️ Paper]

평생학습 문제는 여러 가지 다른 태스크들을 연속해서 학습하는 문제로, 범용 인공지능 에이전트의 연구에 있어 매우 중요하다. 본 논문은 지도학습 분야의 잘 알려진 평생학습 알고리즘, Efficient lifelong learning algorithm (ELLA: 효율적인 평생학습 알고리즘)을 모델기반 강화학습 분야에 적용하는 것을 목표로 한다. 제안하는 알고리즘인 MB-ELRL은 한 번에 하나의 태스크에 접근 가능하지만 태스크들의 동역학들 사이의 공유 가능한 정보를 효율적으로 학습하여 각 태스크를 독립적으로 학습하는 것에 비해 월등한 성능을 보인다.

HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, and Kee-Eung Kim: Variational Interaction Information Maximization for Cross-domain Disentanglement. Advances in Neural Information Processing Systems (NeurIPS). 2020. [📄 Abstract] [✏️ Paper]

Cross-domain disentanglement is the problem of learning representations partitioned into domain-invariant and domain-specific representations, which is a key to successful domain transfer or measuring semantic distance between two domains. Grounded in information theory, we cast the simultaneous learning of domain-invariant and domain-specific representations as a joint objective of multiple information constraints, which does not require adversarial training or gradient reversal layers. We derive a tractable bound of the objective and propose a generative model named Interaction Information Auto-Encoder (IIAE). Our approach reveals insights on the desirable representation for cross-domain disentanglement and its connection to Variational Auto-Encoder (VAE). We demonstrate the validity of our model in learning disentangled representations with the image-to-image translation and the cross-domain retrieval tasks. We further show that our model achieves the state-of-the-art performance in the zero-shot sketch based image retrieval task, even without external knowledge.

Jongmin Lee, Byung-Jun Lee, and Kee-Eung Kim: Reinforcement Learning for Control with Multiple Frequencies. Advances in Neural Information Processing Systems (NeurIPS). 2020. [📄 Abstract] [✏️ Paper]

Many real-world sequential decision problems involve multiple action variables whose control frequencies are different, such that actions take their effects at different periods. While these problems can be formulated with the notion of multiple action persistences in factored-action MDP (FA-MDP), it is non-trivial to solve them efficiently since an action-persistent policy constructed from a stationary policy can be arbitrarily suboptimal, rendering solution methods for the standard FA-MDPs hardly applicable. In this paper, we formalize the problem of multiple control frequencies in RL and provide its efficient solution method. Our proposed method, Action-Persistent Policy Iteration (AP-PI), provides a theoretical guarantee on the convergence to an optimal solution while incurring only a factor of $|\A|$ increase in time complexity during policy improvement step, compared to the standard policy iteration for FA-MDPs. Extending this result, we present Action-Persistent Actor-Critic (AP-AC), a scalable RL algorithm for high-dimensional control tasks. In the experiments, we demonstrate that AP-AC significantly outperforms the baselines on several continuous control tasks and a traffic control simulation, which highlights the effectiveness of our method that directly optimizes the periodic non-stationary policy for tasks with multiple control frequencies.

Geon-Hyeong Kim, Youngsoo Jang, Hongseok Yang, and Kee-Eung Kim: Variational Inference for Sequential Data with Future Likelihood Estimates. Proceedings of the International Conference on Machine Learning (ICML). 2020. [📄 Abstract] [✏️ Paper]

The recent development of flexible and scalable variational inference algorithms has popularized the use of deep probabilistic models in a wide range of applications. However, learning and reasoning about high-dimensional models with non-differentiable densities are still a challenge. For such a model, inference algorithms struggle to estimate the gradients of variational objectives accurately, due to high variance in their estimates. To tackle this challenge, we present a novel variational inference algorithm for sequential data, which performs well even when the density from the model is not differentiable, for instance, due to the use of discrete random variables. The key feature of our algorithm is that it estimates future likelihoods at all time steps. The estimated future likelihoods form the core of our new low-variance gradient estimator. We formally analyze our gradient estimator from the perspective of variational objective, and show the effectiveness of our algorithm with synthetic and real datasets.

Byung-Jun Lee*, Jongmin Lee*, Peter Vrancx, Dongho Kim, and Kee-Eung Kim: Batch Reinforcement Learning with Hyperparameter Gradients. Proceedings of the International Conference on Machine Learning (ICML). 2020. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Donghoon Ham*, Jeong-Gwan Lee*, Youngsoo Jang, and Kee-Eung Kim: End-to-End Neural Pipeline for Goal-Oriented Dialogue System using GPT-2. Annual Conference of the Association for Computational Linguistics (ACL). 2020. [📄 Abstract] [✏️ Paper]

The goal-oriented dialogue system needs to be optimized for tracking the dialogue flow and carrying out an effective conversation under various situations to meet the user goal. The traditional approach to build such a dialogue system is to take a pipelined modular architecture, where its modules are optimized individually. However, such an optimization scheme does not necessarily yield the overall performance improvement of the whole system. On the other hand, end-to-end dialogue systems with monolithic neural architecture are often trained only with input-output utterances, without taking into account the entire annotations available in the corpus. This scheme makes it difficult for goal-oriented dialogues where the system needs to be integrated with external systems or to provide interpretable information about why the system generated a particular response. In this paper, we present an end-to-end neural architecture for dialogue systems that addresses both challenges above. In the human evaluation, our dialogue system achieved the success rate of 68.32%, the language understanding score of 4.149, and the response appropriateness score of 4.287, which ranked the system at the top position in the end-to-end multi-domain dialogue system task in the 8th dialogue systems technology challenge (DSTC8).

Donghoon Ham*, Jeong-Gwan Lee*, Youngsoo Jang, and Kee-Eung Kim: End-to-End Neural Pipeline for Goal-Oriented Dialogue System using GPT-2. AAAI Conference on Artificial Intelligence (AAAI) DSTC8 Workshop. 2020. [📄 Abstract] [✏️ Paper]

The first sub-task in the multi-domain task-completion dialogue challenge track in the 8th dialogue systems technology challenge (DSTC8) requires participants to build an end-to-end dialogue system that is capable of complex multi-domain dialogues. The traditional approach to build such a dialogue system is to take a pipelined architecture, where its modular components are optimized individually. However, such an optimization scheme does not necessarily yield the overall performance improvement of the whole system. On the other hand, most end-to-end dialogue systems with monolithic neural architecture are trained only with input-output utterances, without taking into account the entire annotations available in the corpus. This scheme makes it difficult for goal-oriented dialogues where the system needs to interact with external systems such as database engines or to provide interpretable information about why the system decided to generate a particular response. In this paper, we present an end-to-end neural architecture for dialogue systems that addresses both challenges above. In the official human evaluation, our dialogue system achieved the success rate of 68.32%, the language understanding score of 4.149, and the response appropriateness score of 4.287, which ranked the system at the top position in all performance evaluation criteria.

Byung-Jun Lee, Seunghoon Hong, and Kee-Eung Kim: Residual Neural Processes. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2020. [📄 Abstract] [✏️ Paper]

A Neural Process (NP) is a map from a set of observed input-output pairs to a predictive distribution over functions, which is designed to mimic other stochastic processes' inference mechanisms. NPs are shown to work effectively in tasks that require complex distributions, where traditional stochastic processes struggle, e.g. image completion tasks. This paper concerns the practical capacity of set function approximators despite their universality. By delving deeper into the relationship between an NP and a Bayesian last layer (BLL), it is possible to see that NPs may struggle in simple examples, which other stochastic processes can easily solve. In this paper, we propose a simple yet effective remedy; the Residual Neural Process (RNP) that leverages traditional BLL for faster training and better prediction. We demonstrate that the RNP shows faster convergence and better performance, both qualitatively and quantitatively.

Youngsoo Jang, Jongmin Lee, and Kee-Eung Kim: Bayes-Adaptive Monte-Carlo Planning and Learning for Goal-Oriented Dialogues. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2020. [📄 Abstract] [✏️ Paper]

We consider a strategic dialogue task, where the ability to infer the other agent's goal is critical to the success of the conversational agent. While this problem can be naturally formulated as Bayesian planning, it is known to be a very difficult problem due to its enormous search space consisting of all possible utterances. In this paper, we introduce an efficient Bayes-adaptive planning algorithm for goal-oriented dialogues, which combines RNN-based dialogue generation and MCTS-based Bayesian planning in a novel way, leading to robust decision-making under the uncertainty of the other agent's goal. We then introduce reinforcement learning for the dialogue agent that uses MCTS as a strong policy improvement operator, casting reinforcement learning as iterative alternation of planning and supervised-learning of self-generated dialogues. In the experiments, we demonstrate that our Bayes-adaptive dialogue planning agent significantly outperforms the state-of-the-art in a negotiation dialogue domain. We also show that reinforcement learning via MCTS further improves end-task performance without diverging from human language.

Jongmin Lee, Wonseok Jeon, Geon-Hyeong Kim, and Kee-Eung Kim: Monte-Carlo Tree Search in Continuous Action Spaces with Value Gradients. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2020. [📄 Abstract] [✏️ Paper]

Monte-Carlo Tree Search (MCTS) is the state-of-the-art online planning algorithm for large problems with discrete action spaces. However, many real-world problems involve continuous action spaces, where MCTS is not as effective as in discrete action spaces. This is mainly due to common practices such as coarse discretization of the entire action space and failure to exploit local smoothness. In this paper, we introduce Value-Gradient UCT (VG-UCT), which combines traditional MCTS with gradient-based optimization of action particles. VG-UCT simultaneously performs a global search via UCT with respect to the finitely sampled set of actions and performs a local improvement via action value gradients. In the experiments, we demonstrate that our approach outperforms existing MCTS methods and other strong baseline algorithms for continuous action spaces.

2019

김건형, 장영수, 이종민, and 김기응: 몬테 카를로 목표를 위한 분산 감소 방법. 한국통신학회 하계종합학술발표회. 2019. [📄 Abstract] [✏️ Paper]

본 논문은 이산 잠재 변수를 사용하는 딥 생성 모델을 효율적으로 학습하기 위한 방법에 대한 연구이다. 본 논문에서는 다양한 학습 목표를 통해 딥 생성 모델을 학습할 때, 강화학습 과의 대응을 제시한다. 이를 통해 기존 강화학습에서의 다양한 분산 감소 방법들을 적용할 수 있게 되어, 이를 통해 성능 향상을 꾀한다.

이종민, 김건형, and 김기응: 연속 행동공간에서의 몬테-카를로 트리 탐색에 관한 연구. 한국통신학회 하계종합학술발표회. 2019. [📄 Abstract] [✏️ Paper]

몬테-카를로 트리 탐색 (Monte-Carlo Tree Search; MCTS)은 온라인 계획 알고리즘으로 다양한 이산 행동공간 문제에서 큰 성공을 거둔 바 있지만, 연속행동공간에서는 우선하여 고려되는 방법론은 아니었다. 이는 트리 탐색을 가능하게 만들기 위해 행동공간을 거칠게 이산화하는 작업이 불가피하기 때문이다. 본 논문에서는 연속 행동공간에서 UCT의 전역적 탐색과 환경의 미분 정보를 활용한 지역적 탐색을 결합하는 방법에 대해 다루고자 한다. 연속 행동공간의 제어문제의 벤치마크 실험 결과는 제안하는 방법론이 다양한 비교 대상 알고리즘의 성능을 능가함을 보여준다.

Nianyin Zeng, Zidong Wang, Hong Zhang, Kee-Eung Kim, Yurong Li, and Xiaohui Liu: An Improved Particle Filter With a Novel Hybrid Proposal Distribution for Quantitative Analysis of Gold Immunochromatographic Strips. IEEE Transactions on Nanotechnology, 18:819-829. 2019. [📄 Abstract] [✏️ Paper]

In this paper, a novel statistical pattern recognition method is proposed for accurately segmenting test and control lines from the gold immunochromatographic strip (GICS) images for the benefits of quantitative analysis. A new dynamic state-space model is established, based on which the segmentation task of test and control lines is transformed into a state estimation problem. Especially, the transition equation is utilized to describe the relationship between contour points on the upper and the lower boundaries of test and control lines, and a new observation equation is developed by combining the contrast of between-class variance and the uniformity measure. Then, an innovative particle filter (PF) with a hybrid proposal distribution, namely, deep-belief-network-based particle filter (DBN-PF) is put forward, where the deep belief network (DBN) provides an initial recognition result in the hybrid proposal distribution, and the particle swarm optimization algorithm moves particles to regions of high likelihood. The performance of proposed DBN-PF method is comprehensively evaluated on not only an artificial dataset but also the GICS images in terms of several indices as compared to the PF and DBN methods. It is demonstrated via experiment results that the proposed approach is effective in quantitative analysis of GICS.

Yung-Kyun Noh, Ji Young Park, Byoung Geol Choi, Kee-Eung Kim, and Seung-Woon Rha: A Machine Learning-Based Approach for the Prediction of Acute Coronary Syndrome Requiring Revascularization. Journal of Medical Systems. 2019. [📄 Abstract] [✏️ Paper]

The aim of this study is to predict acute coronary syndrome (ACS) requiring revascularization in those patients presenting early-stage angina-like symptom using machine learning algorithms. We obtained data from 2344 ACS patients, who required revascularization and from 3538 non-ACS patients. We analyzed 20 features that are relevant to ACS using standard algorithms, support vector machines and linear discriminant analysis. Based on feature pattern and filter characteristics, we analyzed and extracted a strong prediction function out of the 20 selected features. The obtained prediction functions are relevant showing the area under curve of 0.860 for the prediction of ACS that requiring revascularization. Some features are missing in many data though they are considered to be very informative; it turned out that omitting those features from the input and using more data without those features for training improves the prediction accuracy. Additionally, from the investigation using the receiver operating characteristic curves, a reliable prediction of 2.60% of non-ACS patients could be made with a specificity of 1.0. For those 2.60% non-ACS patients, we can consider the recommendation of medical treatment without risking misdiagnosis of the patients requiring revascularization. We investigated prediction algorithm to select ACS patients requiring revascularization and non-ACS patients presenting angina-like symptoms at an early stage. In the future, a large cohort study is necessary to increase the prediction accuracy and confirm the possibility of safely discriminating the non-ACS patients from the ACS patients with confidence.

Youngsoo Jang, Jongmin Lee, and Kee-Eung Kim: Bayes-Adaptive Monte-Carlo Planning and Learning for Goal-Oriented Dialogues. Neural Information Processing Systems (NeurIPS) Conversational AI workshop. 2019. [📄 Abstract] [✏️ Paper]

We consider a strategic dialogue task, where the ability to infer the other agent's goal is critical to the success of the conversational agent. While this problem can be naturally formulated as Bayesian planning, it is known to be a very difficult problem due to its enormous search space consisting of all possible utterances. In this paper, we propose an efficient Bayes-adaptive planning algorithm for goal-oriented dialogues, which combines RNN-based dialogue generation and MCTS-based Bayesian planning in a novel way, leading to a robust decision-making under the uncertainty of the other agent's goal. We then introduce reinforcement learning for the dialogue agent that uses MCTS as a strong policy improvement operator, casting reinforcement learning as iterative alternation of planning and supervised-learning of self-generated dialogues. In the experiments, we demonstrate that our Bayes-adaptive dialogue planning agent significantly outperforms the state-of-the-art in a negotiation dialogue domain. We also show that reinforcement learning via MCTS further improves end-task performance without diverging from human language.

Geon-Hyeong Kim, Youngsoo Jang, Jongmin Lee, Wonseok Jeon, Hongseok Yang, and Kee-Eung Kim: Trust Region Sequential Variational Inference. Proceedings of Asian Conference on Machine Learning (ACML). 2019. [📄 Abstract] [✏️ Paper]

Stochastic variational inference has emerged as an effective method for performing inference on or learning complex models for data. Yet, one of the challenges in stochastic variational inference is handling high-dimensional data, such as sequential data, and models with non-differentiable densities caused by, for instance, the use of discrete latent variables. In such cases, it is challenging to control the variance of the gradient estimator used in stochastic variational inference, while low variance is often one of the key properties needed for successful inference. In this work, we present a new algorithm for stochastic variational inference of sequential models which trades off bias for variance to tackle this challenge effectively. Our algorithm is inspired by variance reduction techniques in reinforcement learning, yet it uniquely adopts their key ideas in the context of stochastic variational inference. We demonstrate the effectiveness of our approach through formal analysis and experiments on synthetic and real-world datasets.

Youngsoo Jang*, Jongmin Lee*, Jaeyoung Park*, Kyeng-Hun Lee, Pierre Lison, and Kee-Eung Kim: PyOpenDial: A Python-based Domain-Independent Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules. Proceedings of Empirical Methods in Natural Language Processing (EMNLP) System Demonstrations. 2019. [📄 Abstract] [✏️ Paper]

We present PyOpenDial, a Python-based domain-independent, open-source toolkit for spoken dialogue systems. Recent advances in core components of dialogue systems, such as speech recognition, language understanding, dialogue management, and language generation, harness deep learning to achieve state-of-the-art performance. The original OpenDial, implemented in Java, provides a plugin architecture to integrate external modules, but lacks Python bindings, making it difficult to interface with popular deep learning frameworks such as Tensorflow or PyTorch. To this end, we re-implemented OpenDial in Python and extended the toolkit with a number of novel functionalities for neural dialogue state tracking and action planning. We describe the overall architecture and its extensions, and illustrate their use on an example where the system response model is implemented with a recurrent neural network.

강민구 and 김기응: 강화학습을 이용한 초고속비행체 제어기 학습. 한국군사과학기술학회 종합학술대회. 2019. [📄 Abstract] [✏️ Paper]

비행체가 초고속 비행시 발생하는 양력 (Lift) 및 저항력 (Drag) 과 같은 요소는 시스템에 높은 비선형성을 발생시키는데,본 연구에서는 이러한 비선형성을 가지는 환경하에서도 데이터 기반 (Data-driven) 국부 최적 (Local optimal) 비행체 제어기 학습이 가능함을 보였다. 본 연구는 데이터 기반 최적 제어 방법론 (강화학습)을 적용함으로써 전통적인 모델 기반 제어 이론적 방법론과 차별화를 달성하였다. 본 연구의 결과물은 개념 증명(Proof of Concept) 수준이지만, 추가적인 hyper-parameter tuning 및 더 많은 컴퓨터 자원 사용을 통해 비행체 제어의 추가적인 성능 향상을 기대할 수 있다.

Kanghoon Lee, Geon-Hyeong Kim, Pedro Ortega, Daniel D. Lee, and Kee-Eung Kim: Bayesian optimistic Kullback-Leibler exploration. Machine Learning Journal (MLJ), 108. 2019. [📄 Abstract] [🔗 Link]

We consider a Bayesian approach to model-based reinforcement learning, where the agent uses a distribution of environment models to find the action that optimally trades off exploration and exploitation. Unfortunately, it is intractable to find the Bayes-optimal solution to the problem except for restricted cases. In this paper, we present BOKLE, a simple algorithm that uses Kullback–Leibler divergence to constrain the set of plausible models for guiding the exploration. We provide a formal analysis that this algorithm is near Bayes-optimal with high probability. We also show an asymptotic relation between the solution pursued by BOKLE and a well-known algorithm called Bayesian exploration bonus. Finally, we show experimental results that clearly demonstrate the exploration efficiency of the algorithm.

2018

김건형, 장영수, 이종민, and 김기응: 모델 기반 베이지안 강화학습의 연속된 도메인으로의 확장에 대한 연구. 한국통신학회 하계종합학술발표회 논문집. 2018. [📄 Abstract] [🔗 Link]

본 논문은 기존의 모델 기반 베이지안 강화학습 문제가 한정된 작은 도메인에만 적용되는 것을 보다 일반적이고 연속적인 도메인에 적용하기 위한 연구이다. 이를 위해 기존의 모델 기반 베이지안 강화학습 연구에서 변분 추론 기법을 활용하여 보다 연속적인 도메인에서의 사후 분포 업데이트를 가능하도록 하고자 한다. 결과적으로 이를 통해 연속된 도메인에서 모델 기반 베이지안 강화학습을 적용하고자 한다.

Wonseok Jeon, Seokin Seo, and Kee-Eung Kim: A Bayesian Approach to Generative Adversarial Imitation Learning. Advances in Neural Information Processing Systems (NeurIPS). 2018. Spotlight [📄 Abstract] [✏️ Paper]

Generative adversarial training for imitation learning has shown promising results on high-dimensional and continuous control tasks. This paradigm is based on reducing the imitation learning problem to the density matching problem, where the agent iteratively refines the policy to match the empirical state-action visitation frequency of the expert demonstration. Although this approach can robustly learn to imitate even with scarce demonstration, one must still address the inherent challenge that collecting trajectory samples in each iteration is a costly operation. To address this issue, we first propose a Bayesian formulation of generative adversarial imitation learning (GAIL), where the imitation policy and the cost function are represented as stochastic neural networks. Then, we show that we can significantly enhance the sample efficiency of GAIL leveraging the predictive density of the cost, on an extensive set of imitation learning tasks with high-dimensional states and actions. Generative adversarial training for imitation learning has shown promising results on high-dimensional and continuous control tasks. This paradigm is based on reducing the imitation learning problem to the density matching problem, where the agent iteratively refines the policy to match the empirical state-action visitation frequency of the expert demonstration. Although this approach can robustly learn to imitate even with scarce demonstration, one must still address the inherent challenge that collecting trajectory samples in each iteration is a costly operation. To address this issue, we first propose a Bayesian formulation of generative adversarial imitation learning (GAIL), where the imitation policy and the cost function are represented as stochastic neural networks. Then, we show that we can significantly enhance the sample efficiency of GAIL leveraging the predictive density of the cost, on an extensive set of imitation learning tasks with high-dimensional states and actions.

Jongmin Lee, Geon-Hyeong Kim, Pascal Poupart, and Kee-Eung Kim: Monte-Carlo Tree Search for Constrained POMDPs. Advances in Neural Information Processing Systems (NeurIPS). 2018. [📄 Abstract] [✏️ Paper]

Monte-Carlo Tree Search (MCTS) has been successfully applied to very large POMDPs, a standard model for stochastic sequential decision-making problems. However, many real-world problems inherently have multiple goals, where multi-objective formulations are more natural. The constrained POMDP (CPOMDP) is such a model that maximizes the reward while constraining the cost, extending the standard POMDP model. To date, solution methods for CPOMDPs assume an explicit model of the environment, and thus are hardly applicable to large-scale real-world problems. In this paper, we present CC-POMCP (Cost-Constrained POMCP), an online MCTS algorithm for large CPOMDPs that leverages the optimization of LP-induced parameters and only requires a black-box simulator of the environment. In the experiments, we demonstrate that CC-POMCP converges to the optimal stochastic action selection in CPOMDP and pushes the state-of-the-art by being able to scale to very large problems.

Eun Sang Cha, Kee-Eung Kim, Stefano Longo, and Ankur Mehta: OP-CAS: Collision Avoidance with Overtaking Maneuvers. Proceedings of the IEEE Intelligent Transport Systems Conference (ITSC). 2018. [📄 Abstract] [✏️ Paper]

This paper presents a novel collision avoidance system for autonomous vehicles based on overtaking procedures. The proposed Overtaking Procedure for Collision Avoidance Systems (OP-CAS) takes a behavioral cloning-based approach which uses images obtained out of a low cost monocular camera. The algorithm selectively records the expert’s corrective driving behavior during data collection. This is performed recording oscillatory driving behavior when the vehicle is returning to the center of the lane. This data augmentation method addresses the issue of covariate shift commonly found in behavioral cloning methods. This approach is computationally inexpensive, making it a viable option for real time embedded deployment. A feasibility study was performed with two remotely controlled scaled vehicles as a proof of concept. Results showed that when two expert drivers demonstrated overtaking behaviors for data collection, even a small dataset was sufficient to model the overtaking sequence. The overtaking maneuvers were deployed in real time on 1/8th scale RC platforms, validating OP-CAS for civilian vehicle safety applications.

MinKu Kang and Kee-Eung Kim: Simulated Physics for High Speed Aerial Systems. Proceedings of International Conference on Control, Automation and Systems (ICCAS). 2018. [📄 Abstract] [✏️ Paper]

In this work, we introduce a model of an aerial system based on a physics-based simulation engine. We investigate some basic properties of the proposed model, showing its potential benefit for autonomous control.

Jongmin Lee, Geon-Hyeong Kim, Pascal Poupart, and Kee-Eung Kim: Monte-Carlo Tree Search for Constrained MDPs. ICML/IJCAI/AAMAS Workshop on Planning and Learning (PAL). 2018. [📄 Abstract] [✏️ Paper]

Monte-Carlo Tree Search (MCTS) is the state-of-the-art online planning algorithm for very large MDPs. However, many real-world problems inherently have multiple goals, where multi-objective sequential decision models are more natural. The constrained MDP (CMDP) is such a model that maximizes the reward while constraining the cost. The common solution method for CMDPs is linear programming (LP), which is hardly applicable to large real-world problems. In this paper, we present CCUCT (Cost-Constrained UCT), an online planning algorithm for large constrained MDPs (CMDPs) that leverages the optimization of LP-induced parameters. We show that CCUCT converges to the optimal stochastic action selection in CMDPs and it is able to solve very large CMDPs through experiments on the multi-objective version of an Atari 2600 arcade game.

Youngsoo Jang, Jiyeon Ham, Byung-Jun Lee, and Kee-Eung Kim: Cross-language Neural Dialog State Tracker for Large Ontologies using Hierarchical Attention. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP). 2018. [📄 Abstract] [🔗 Link]

Dialog state tracking, which refers to identifying the user intent from utterances, is one of the most important tasks in dialog management. In this paper, we present our dialog state tracker developed for the Fifth Dialog State Tracking Challenge, which focused on cross-language adaptation using very scarce machine-translated training data when compared to the size of the ontology. Our dialog state tracker is based on the bi-directional long short-term memory network with a hierarchical attention mechanism in order to spot important words in user utterances. The user intent is predicted by finding the closest keyword in the ontology to the attention-weighted word vector. With the suggested methodology, our tracker can overcome various difficulties due to the scarce training data that existing machine learning based trackers had, such as predicting user intents they haven't seen before. We show that our tracker outperforms other trackers submitted to the challenge with respect to most of the performance measures.

Jiyeon Ham, Soohyun Lim, Kyeng-Hun Lee, and Kee-Eung Kim: Extensions to hybrid code networks for FAIR dialog dataset. Computer Speech and Language:12. 2018. [📄 Abstract] [🔗 Link]

Goal-oriented dialog systems require a different approach from chit-chat conversational systems in that they should perform various subtasks as well as continue the conversation itself. Since these systems typically interact with an external knowledge base that changes over time, it is desirable to incorporate domain knowledge to deal with such changes, yet with minimum human effort. This paper presents an extended version of the Hybrid Code Network (HCN) developed for the Facebook AI research (FAIR) dialog dataset used in the Sixth Dialog System Technology Challenge (DSTC6). Compared to the original HCN, the system was more adaptable to changes in the knowledge base due to the modules that are extended to be learned from data. Using the proposed learning scheme with fairly elementary domain-specific rules, the proposed model achieved 100% accuracy in all test datasets.

Jang Won Bae, Junseok Lee, Do-Hyung Kim, Kanghoon Lee, Jongmin Lee, Kee-Eung Kim, and Il-Chul Moon: Layered Behavior Modeling via Combining Descriptive and Prescriptive Approaches: a Case Study of Infantry Company Engagement. IEEE Transactions on System, Man, and Cybernetics: Systems. 2018. [📄 Abstract] [🔗 Link]

Defense modeling and simulation (DM&S) has brought insights into how to efficiently operate combat entities, such as soldiers and weapon systems. Most DM&S works have been developed to reflect accurate descriptions of military doctrines, yet these doctrines provide only guidelines of military operations, not details about how the combat entities should behave. Because such vague parts are often fulfilled with the appropriate behavior of combat entities in a battlefield, one part argues that DM&S should consider individual combat behaviors as well. However, it is known as an infeasible problem discovering best individual actions from infinite searching space, such as the battlefield. This paper proposes a layered behavior modeling to practically resolve this issue. The proposed method applies descriptive modeling to reduce the searching space by employing domain-specific knowledge; and prescriptive modeling to discover best individual actions in the reduced space. For the generalization, the proposed method adapts both modeling methods being modularized, and then the proposed method suggested an interface between them that is based on their semantic analogies. Both modeling methods are modularized, so they are interacted through an interface defined in the proposed method. This paper presents a realization of the proposed method through a case study of infantry company-level operations. In the case study, the proposed method is implemented with discrete event system specification formalism as the descriptive part and Markov decision process as the prescriptive part. The experimental results illustrated that the combat effectiveness resulted from the proposed method is statistically better than that from the descriptive-only modeling, and the difference would be guided by the objective of the combat behavior. Through the presented experimental results and the discussion, this paper argues that future DM&S should consider a broad spectrum from the battlefield incorporating the rational behavior of military individuals.

Kee-Eung Kim and Hyun-Soo Park: Imitation Learning via Kernel Mean Embedding. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2018. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Imitation learning refers to the problem where an agent learns a policy that mimics the demonstration provided by the expert, without any information on the cost function of the environment. Classical approaches to imitation learning usually rely on a restrictive class of cost functions that best explains the expert's demonstration, exemplified by linear functions of pre-defined features on states and actions. We show that the kernelization of a classical algorithm naturally reduces the imitation learning to a distribution learning problem, where the imitation policy tries to match the state-action visitation distribution of the expert. Closely related to our approach is the recent work on leveraging generative adversarial networks (GANs) for imitation learning, but our reduction to distribution learning is much simpler, robust to scarce expert demonstration, and sample efficient. We demonstrate the effectiveness of our approach on a wide range of high-dimensional control tasks.

2017

Jiyeon Ham, Soohyun Lim, and Kee-Eung Kim: Extended Hybrid Code Networks for DSTC6 FAIR Dialog Dataset. Dialog System Technology Challenges 6 Workshop. 2017. [📄 Abstract] [✏️ Paper]

Goal-oriented dialog systems require a different approach compared to chit-chat conversations in that they should perform various subtasks as well as the dialog itself. Since the systems typically interact with an external database, it is efficient to import simple domain knowledge in order to deal with the external knowledge changes. This paper presents extended hybrid code networks for sixth dialog system technology challenge (DSTC6) Facebook AI research (FAIR) dialog dataset. Compared to the original hybrid code networks (HCNs), we reduced the required hand-coded rules and added trainable submodules. Due to the additional learning components and reasonable domain-specific rules, the proposed model can be applied to more complex domains and achieved 100% accuracies for all test sets.

Yung-Kyun Noh, Masashi Sugiyama, Kee-Eung Kim, Frank Park, and Daniel Lee: Generative Local Metric Learning for Kernel Regression. Advances in Neural Information Processing Systems (NIPS). 2017. [📄 Abstract] [✏️ Paper]

This paper shows how metric learning can be used with Nadaraya-Watson (NW) kernel regression. Compared with standard approaches such as bandwidth selection, we show how metric learning can significantly reduce the mean square error (MSE) in kernel regression, particularly for high-dimensional data. We propose a method for efficiently learning a good metric function based upon analyzing the performance of the NW estimator for Gaussian-distributed data. A key feature of our approach is that the NW estimator with a learned metric uses information from both the global and local structure of the training data. Theoretical and empirical results confirm that the learned metric can considerably reduce the bias and MSE for kernel regression.

Jang Won Bae, Bowon Nam, Kee-Eung Kim, Junseok Lee, and Il-Chul Moon: Hybrid Modeling and Simulation of Tactical Maneuvers in Computer Generated Force. Proceedings of the IEEE Conference on System, Man, and Cybernetics (SMC). 2017. [📄 Abstract] [✏️ Paper]

Defense modeling and simulation (DM&S) offers insights into the efficient operations of combat entities, e.g., soldiers and weapon systems. Most DM&S aim at exact description of military doctrines, but often the doctrines fails to provide detail action procedures about how the combat entities conduct military operations. Such unspecified descriptions are filled with the rational behaviors of the combat entities in a battlefield, and thereby the combat effectiveness from these combat entities would differ. Also, by incorporating such rational factors, this could provide the insights that cannot be captured from the traditional works. To examine this postulation, this paper developed a computer generated force where the tactical maneuver of combat entities are realized by the combination of descriptive and prescriptive modeling. Specifically, the descriptive models describe the explicit action rules in military doctrines, and they are modeled using discrete event system specification (DEVS) formalism; the predictive models denoted the rational behavior of the combat entities under the military doctrines, and they are modeled using partially observable Markov decision process (POMDP). The provided results illustrated that the proposed approach helps to maintain a team formation effectively, and this formation maintenance lead to the better combat efficiency.

Jongmin Lee, Youngsoo Jang, Pascal Poupart, and Kee-Eung Kim: Constrained Bayesian Reinforcement Learning via Approximate Linear Programming. ECML-PKDD Workshop on Scaling-Up Reinforcement Learning (SURL). 2017. [📄 Abstract] [✏️ Paper]

In this paper, we highlight our recent work~\cite{Lee2017} considering the safe learning scenario where we need to restrict the exploratory behavior of a reinforcement learning agent. Specifically, we treat the problem as a form of Bayesian reinforcement learning (BRL) in an environment that is modeled as a constrained MDP (CMDP) where the cost function penalizes undesirable situations. We propose a model-based BRL algorithm for such an environment, eliciting risk-sensitive exploration in a principled way. Our algorithm efficiently solves the constrained BRL problem by approximate linear programming, and generates a finite state controller in an off-line manner. We provide theoretical guarantees and demonstrate empirically that our approach outperforms the state of the art.

이종민, 홍정표, 박재영, 이강훈, 김기응, 문일철, and 박재현: 대화력전 및 기계화 보병 시나리오를 통한 대규모 가상군의 POMDP 행동계획 및 학습 사례연구. 정보과학회 컴퓨팅의 실제 논문지, 23(6):343-349. 2017. [📄 Abstract] [✏️ Paper]

대규모 가상군의 전투 모델링 및 시뮬레이션에서 자율적으로 행동하는 이성적 전투 개체의 행동 묘사는 향후 발생할 전투의 작전을 고도화하고 효율적인 모의 훈련을 가능하게 하는 핵심 요소이다. DEVS-POMDP 계층적 프레임워크는 전투 행동 교범에 따른 상위 단계 의사결정 및 구체적 서술이 어려운 하위 단계 자율 행동계획을 각각 DEVS 및 POMDP로 모델링함으로써 대규모 가상군을 모의하였으나, POMDP 최적 행동정책 계산에 있어서 많은 컴퓨팅 자원를 필요로 하는 단점이 있었다. 본 논문에서는 DEVS-POMDP로 모델링된 대화력전 모의 시나리오 및 기계화 보병여단 공격작전 모의 시나리오의 사례연구를 통해 효율적인 POMDP 트리 탐색 알고리즘을 제안하고 적군 행동 양상 모델의 학습을 통한 가상군 전투 개체의 성능 향상을 확인한다.

Jongmin Lee, Youngsoo Jang, Pascal Poupart, and Kee-Eung Kim: Constrained Bayesian Reinforcement Learning via Approximate Linear Programming. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2017. [📄 Abstract] [✏️ Paper]

In this paper, we consider the safe learning scenario where we need to restrict the exploratory behavior of a reinforcement learning agent. Specifically, we treat the problem as a form of Bayesian reinforcement learning in an environment that is modeled as a constrained MDP (CMDP) where the cost function penalizes undesirable situations. We propose a model-based Bayesian reinforcement learning (BRL) algorithm for such an environment, eliciting risk-sensitive exploration in a principled way. Our algorithm efficiently solves the constrained BRL problem by approximate linear programming, and generates a finite state controller in an offline manner. We provide theoretical guarantees and demonstrate empirically that our approach outperforms the state of the art.

Byung-Jun Lee, Jongmin Lee, and Kee-Eung Kim: Hierarchically-partitioned Gaussian Process Approximation. Proceedings of Artificial Intelligence and Statistics (AISTATS). 2017. [📄 Abstract] [✏️ Paper]

The Gaussian process (GP) is a simple yet powerful probabilistic framework for various machine learning tasks. However, exact algorithms for learning and prediction are prohibitive to be applied to large datasets due to inherent computational complexity. To overcome this main limitation, various techniques have been proposed, and in particular, local GP algorithms that scale "truly linearly" with respect to the dataset size. In this paper, we introduce a hierarchical model based on local GP for large-scale datasets, which stacks inducing points over inducing points in layers. By using different kernels in each layer, the overall model becomes multi-scale and is able to capture both long- and short-range dependencies. We demonstrate the effectiveness of our model by speed-accuracy performance on challenging real-world datasets.

2016

Youngsoo Jang, Jiyeon Ham, Byung-Jun Lee, Youngjae Chang, and Kee-Eung Kim: Neural Dialog State Tracker for Large Ontologies by Attention Mechanism. IEEE Workshop on Spoken Language Technology. 2016. [📄 Abstract] [✏️ Paper]

This paper presents a dialog state tracker submitted to Dialog State Tracking Challenge 5 (DSTC 5) with details. To tackle the challenging cross-language human-human dialog state tracking task with limited training data, we propose a tracker that focuses on words with meaningful context based on attention mechanism and bi-directional long short term memory (LSTM). The vocabulary including a plenty of proper nouns is vectorized with a sufficient amount of related texts crawled from web to learn a good embedding for words not existent in training dialogs. Despite its simplicity, our proposed tracker succeeded to achieve high accuracy without sophisticated pre- and post-processing.

Daehyun Lee, Jongmin Lee, and Kee-Eung Kim: Multi-View Automatic Lip-Reading using Neural Network. ACCV 2016 Workshop on Multi-view Lip-reading Challenges. 2016. [📄 Abstract] [✏️ Paper]

It is well known that automatic lip-reading (ALR), also known as visual speech recognition (VSR), enhances the performance of speech recognition in a noisy environment and also has applications itself. However, ALR is a challenging task due to various lip shapes and ambiguity of visemes (the basic unit of visual speech information). In this paper, we tackle ALR as a classification task using end-to-end neural network based on convolutional neural network and long short-term memory architecture. We conduct single, cross, and multi-view experiments in speaker independent setting with various network configuration to integrate the multi-view data. We achieve 77.9%, 83.8%, and 78.6% classification accuracies in average on single, cross, and multi-view respectively. This result is better than the best score (76%) of preliminary single-view results given by ACCV 2016 workshop on multi-view lip-reading/audiovisual challenges. It also shows that additional view information helps to improve the performance of ALR with neural network architecture.

홍정표, 이종민, 이강훈, 한상규, 김기응, 문일철, and 박재현: 대규모 가상군의 POMDP 행동계획 및 학습 사례연구. 한국정보과학회 하계학술발표회 논문집. 2016. [📄 Abstract] [✏️ Paper]

대규모 가상군의 전투 모델링 및 시뮬레이션은 향후 발생할 전투의 작전을 고도화하고 효율적인 모의 훈련을 가능하게 한다. 이를 위해, DEVS-POMDP 계층적 프레임워크에서는 전투 행동 교범과 그에 따른 구체적인 행동계획을 각각 DEVS와 POMDP로 모델링하여 가상군의 자율적인 행동을 모의하였으나, POMDP 모델에서 최적 행동정책을 계산하는 것은 여전히 많은 컴퓨팅 자원를 필요로 한다. 본 논문에서 는 DEVS-POMDP로 모델링된 연평도 대화력전 모의 시나리오의 사례연구를 통해 효율적인 POMDP 트 리 탐색 알고리즘 및 적군 행동 양상 모델의 학습을 통한 가상군 전투 개체의 성능 향상을 확인한다.

홍택규, 김건형, 이병준, and 김기응: Multi-armed Bandit을 이용한 요격 무장 할당 문제의 확률적인 접근. 한국정보과학회 하계학술발표회 논문집. 2016. [📄 Abstract] [✏️ Paper]

본 논문에서는 적군이 아군의 기지를 향해 무장을 발사했을 때 이를 요격하기 위해 발사할 요격 무장의 개수를 결정하는 문제를 다룬다. 기존의 요격 무장 할당 문제에 관한 연구들은 요격 무장의 요격 성공 확률을 알고 있다는 비현실적인 가정을 두었다. 하지만 실제 전쟁 중에는 상황에 따라 요격 성공 확률이 기존에 가정한 값과는 달라질 수 있으므로 더욱 현실적인 연구가 되려면 이 확률이 알려져 있지 않다고 가정한 채 진행되어야 한다. 따라서 본 논문에서는 요격 성공 확률이 알려져 있지 않다는 가정을 바탕으로 요격 무장 할당 문제를 multi-armed bandit 문제로 모델링하여 이 문제를 해결하는 방법을 제시한다.

Teakgyu Hong, Jongmin Lee, Kee-Eung Kim, Pedro A. Ortega, and Daniel Lee: Bayesian Reinforcement Learning with Behavioral Feedback. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1571-1577. 2016. [📄 Abstract] [🔗 Link]

In the standard reinforcement learning setting, the agent learns optimal policy solely from state transitions and rewards from the environment. We consider an extended setting where a trainer additionally provides feedback on the actions executed by the agent. This requires appropriately incorporating the feedback, even when the feedback is not necessarily accurate. In this paper, we present a Bayesian approach to this extended reinforcement learning setting. Specifically, we extend Kalman Temporal Difference learning to compute the posterior distribution over Q-values given the state transitions and rewards from the environment as well as the feedback from the trainer. Through experiments on standard reinforcement learning tasks, we show that learning performance can be significantly improved even with inaccurate feedback.

Byung-Jun Lee and Kee-Eung Kim: Dialog History Construction with Long-Short Term Memory for Robust Generative Dialog State Tracking. Dialogue & Discourse 7(3). 2016. [📄 Abstract] [✏️ Paper]

One of the crucial components of dialog system is the dialog state tracker, which infers user's intention from preliminary speech processing. Since the overall performance of the dialog system is heavily affected by that of the dialog tracker, it has been one of the core areas of research on dialog systems. In this paper, we present a dialog state tracker that combines a generative probabilistic model of dialog state tracking with the recurrent neural network for encoding important aspects of the dialog history. We describe a two-step gradient descent algorithm that optimizes the tracker with a complex loss function. We demonstrate that this approach yields a dialog state tracker that performs competitively with top-performing trackers participated in the first and second Dialog State Tracking Challenges.

Yeganeh Mashayekh Hayeri, Kee-Eung Kim, and Daniel D. Lee: An Inverse Reinforcement Learning Approach to Car Following Behaviors. TRB 95th Annual Meeting Compendium of Papers, Transportation Research Board. 2016. [📄 Abstract] [🔗 Link]

In this study we provide new insights into the classic car-following theories by learning drivers’ behavioral preferences. We model car-following behavior using decision-theoretic techniques. We assume the driver is a decision maker acting based on a utility function that assigns the degree of desirability of the driving situation. Our method is to use inverse problem in control theory, also known as inverse reinforcement learning in a more modern terminology in machine learning. We use a publically available dataset on the car-following behavior known as the Bosch dataset, which includes headway distance, speed and acceleration data. Our simulation results discover the reward function that makes the actual driving behavior in the data preferable to any other behavior. Understanding such behaviors and preferences is becoming crucial as we are entering the modern era of transportation automation. Considering drivers’ preferences while designing for automation features would improve the safety and efficiency of the driving environment while ensuring desirable and comfortable setting for those inside the vehicles.

2015

홍택규, 이병준, 김건형, and 김기응: 계층형 모델링을 통한 순차적 무장 할당 문제의 효과적인 접근. 한국정보과학회 동계학술발표회 논문집. 2015. [📄 Abstract] [🔗 Link]

무장 할당 문제(weapon-allocation problem)는 적군의 산발적인 무장 공격에 대해서 아군의 요격 무장을 효과적으로 할당하는 문제를 말한다. 본 논문에서는 여러 시간에 걸쳐서 아군을 향해 발사되는 순차적인 무장 할당 문제에 대해서 다루는데, 이러한 순차적 무장 할당 문제는 적군의 무장 개수, 아군의 자산개수, 아군의 요격 무장 개수 등에 따라서 상태(state)의 수가 폭발적으로 많아지게 된다. 그렇게 될 경우,고전적인 알고리즘으로는 문제를 해결할 수 없게 된다. 본 논문에서는 기존의 알고리즘을 이용한 계층형모델링을 통해 무장 할당 문제를 효과적으로 해결하는 방법을 제시한다.

Pedro Ortega, Kee-Eung Kim, and Daniel Lee: Reactive bandits with attitude. Proceedings of Artificial Intelligence and Statistics (AISTATS). 2015. [📄 Abstract] [✏️ Paper]

We consider a general class of K-armed bandits that adapt to the actions of the player. A single continuous parameter characterizes the "attitude" of the bandit, ranging from stochastic to cooperative or to fully adversarial in nature. The player seeks to maximize the expected return from the adaptive bandit, and the associated optimization problem is related to the free energy of a statistical mechanical system under an external field. When the underlying stochastic distribution is Gaussian, we derive an analytic solution for the long run optimal player strategy for different regimes of the bandit. In the fully adversarial limit, this solution is equivalent to the Nash equilibrium of a two-player, zero-sum semi-infinite game. We show how optimal strategies can be learned from sequential draws and reward observations in these adaptive bandits using Bayesian filtering and Thompson sampling. Results show the qualitative difference in policy regret between our proposed strategy and other well-known bandit algorithms.

Jaedeug Choi and Kee-Eung Kim: Hierarchical Bayesian Inverse Reinforcement Learning. IEEE Transactions on Cybernetics, 45(4). 2015. [📄 Abstract] [✏️ Paper]

Inverse reinforcement learning (IRL) is the problem of inferring the underlying reward function from the expert’s behavior data. The difficulty in IRL mainly arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behavior data as optimal. Another difficulty comes from the noisy behavior data due to sub-optimal experts. We propose a hierarchical Bayesian framework, which subsumes most of the previous IRL algorithms as well as models the sub-optimality of the expert’s behavior. Using a number of experiments on a synthetic problem, we demonstrate the effectiveness of our approach including the robustness of our hierarchical Bayesian framework to the sub-optimal expert behavior data. Using a real dataset from taxi GPS traces, we additionally show that our approach predicts the driving behavior with a high accuracy.

Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling: Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2015. [📄 Abstract] [✏️ Paper]

In many situations, it is desirable to optimize a primary objective while respecting some constraints with respect to secondary objectives. In this work, we describe a technique based on approximate linear programming to optimize policies in constrained partially observable Markov decision processes. The optimization is performed offline and produces a finite state controller with desirable performance guarantees. The approach performs favorably in comparison to a constrained version of point-based value iteration on a suite of benchmark problems.

Hyeoneun Kim, Woosang Lim, Kanghoon Lee, Yung-Kyun Noh, and Kee-Eung Kim: Reward Shaping for Model-Based Bayesian Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2015. [📄 Abstract] [✏️ Paper]

Bayesian reinforcement learning (BRL) provides a formal framework for optimal exploration-exploitation tradeoff in reinforcement learning. Unfortunately, it is generally intractable to find the Bayes-optimal behavior except for restricted cases. As a consequence, many BRL algorithms, model-based approaches in particular, rely on approximated models or real-time search methods. In this paper, we present potential-based shaping for improving the learning performance in model-based BRL. We propose a number of potential functions that are particularly well suited for BRL, and are domain-independent in the sense that they do not require any prior knowledge about the actual environment. By incorporating the potential function into real-time heuristic search, we show that we can significantly improve the learning performance in standard benchmark domains.

Kanghoon Lee and Kee-Eung Kim: Tighter Value Function Bounds for Bayesian Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2015. [📄 Abstract] [✏️ Paper]

Bayesian reinforcement learning (BRL) provides a principled framework for optimal exploration-exploitation tradeoff in reinforcement learning. We focus on model-based BRL, which involves a compact formulation of the optimal tradeoff from the Bayesian perspective. However, it still remains a computational challenge to compute the Bayes-optimal policy. In this paper, we propose a novel approach to compute tighter value function bounds of the Bayes-optimal value function, which is crucial for improving the performance of many model-based BRL algorithms.We then present how our bounds can be integrated into real-time AO* heuristic search, and provide a theoretical analysis on the impact of improved bounds on the search efficiency. We also provide empirical results on standard BRL domains that demonstrate the effectiveness of our approach.

2014

Byung-Jun Lee, Woosang Lim, and Kee-Eung Kim: Optimizing Generative Dialog State Tracker via Cascading Gradient Descent. Proceedings of the SIGDIAL, pp. 273-281. 2014. [📄 Abstract] [✏️ Paper]

For robust spoken dialog management, various dialog state tracking methods have been proposed. Although discriminative models are gaining popularity due to their superior performance, generative models based on the Partially Observable Markov Decision Process model still remain attractive since they provide an integrated framework for dialog state tracking and dialog policy optimization. Although a straightforward way to fit a generative model is to independently train the component probability models, we present a gradient descent algorithm that simultaneously train all the component models. We show that the resulting tracker performs competitively with other top-performing trackers that participated in DSTC2.

Hyeoneun Kim, Bongseok Goh, Bowon Nam, Kanghoon Lee, Jeong Hee Hong, Il Chul Moon, and Kee-Eung Kim: Multi-Level Hybrid Behavior Model of Computer Generated Forces. Proceedings of the AAMAS Workshop on Agents, Virtual Societies and Analytics. 2014. [📄 Abstract] [✏️ Paper]

Computer Generated Forces (CGFs) refer to the simulation models of combat entities. While the holy grail of CGFs is the realistic reflection of the entities, it is difficult to achieve since the model is often too sophisticated to be replicated. Traditional models which translate field manuals to descriptive models generally produce reliable behaviors, but concern about being brittle in undescribed or unexpected situations is still remaining. In this respect, automated planning approaches can produce robust behaviors for dynamic situations, but the computational resource is too demanding to compute full-scale solutions. This paper proposes a multi-level behavior modeling approach that adopts the knowledgeengineering approach to describe high-level tactical behavior rules and the automated planning approach to compute low-level combat actions in dynamic combat situations. We show that this two-level approach ensures reliable behaviors with moderate computation time.

홍택규, 고봉석, and 김기응: 키-시퀀스 예측을 통한 가변형 소프트 키보드 - 안드로이드 플랫폼 적용 사례 연구. 한국컴퓨터종합학술대회 논문집, pp. 1767-1769. 2014. [📄 Abstract] [✏️ Paper]

소프트 키보드(soft keyboard)는 소프트웨어로 제어가 가능하다는 점과 물리적인 키보드를 사용할 때보다 넓은 화면을 사용할 수 있다는 장점으로 인해 최근 대부분의 스마트폰에서 이용되고 있다. 이러한 소프트 키보드의 장점에도 불구하고 스마트폰 자체의 작은 화면 크기로 인해 사용자들은 많은 오타를 낸다. 이 문제를 해결하기 위해 Microsoft Research의 Gunawardana et al. 은 키-시퀀스(key-sequence) 예측을 통한 가변형 소프트 키보드를 제안하였다. 이는 사용자가 누른 키보드 위치와 현재까지 입력한 문자들을 바탕으로 다음에 입력할 문자들에 대한 확률 모델을 만들고, 이를 바탕으로 키보드를 구성하는 각 키(key)의 감지 영역을 변환하는 방법이다. 본 논문은 안드로이드 플랫폼(android platform)에서 키-시퀀스 예측을 통한 가변형 소프트 키보드를 적용한 사례와 한국어 키보드로도 적용한 사례를 보인다.

2013

배장원, 이강훈, 김현은, 이준석, 고봉석, 남보원, 문일철, 김기응, and 박재현: POMDP-DEVS를 활용한 전투 개체 모델링. 대한산업공학회지, 39(6):498-516. 2013. [📄 Abstract] [✏️ Paper]

Combat Modeling and Simulation (M&S) is significant to decision makers who predict the next direction of wars. Classical methodologies for combat M&S aimed to describe the exact behaviors of combat entities from military doctrines, yet they had a limitation of describing reasonable behaviors of combat entities that did not appear in the doctrines. Hence, this paper proposed a synthesizing modeling methodology for combat entity models considering both 1) the exact behaviors using descriptive modeling and 2) the reasonable behaviors using prescriptive modeling. With the proposed methodology, combat entities can represent a reality for combat actions rather than the classical methodologies. Moreover, the experiment results using the proposed methodology were significantly different from the results using the classical methodologies. Through the analyses of the experiment results, we showed that the reasonable behaviors of combat entities, which are not specified in the doctrines, should be considered in combat M&S.

임희진, 최재득, 석재현, and 김기응: 추천시스템을 위한 베이지안 협력-경쟁 필터링. 한국컴퓨터종합학술대회 논문집, pp. 1496-1498. 2013. [📄 Abstract] [✏️ Paper]

추천시스템은 시스템이 제공하는 추천과 이에 따른 사용자의 응답이라는 상호작용을 수반하는 시스템이다. 이러한 상호작용 과정을 추천모델의 학습에 사용하는 협력-경쟁 필터링(Collaborative competitive filtering)이라는 선택기반의 추천시스템이 최근 제안되었다. 하지만 이 알고리즘은 많은 계산비용을 수반하는 정규화 매개변수(regularization parameter) 조정과정이 필요하다는 단점이 존재한다. 본 논문에서는 협력-경쟁 필터링에 베이지안(Bayesian)기법을 적용하여 매개변수의 조정과정이 필요하지 않은 베이지안 협력-경쟁 필터링(Bayesian collaborative competitive filtering)을 제안한다. 또한, 모델에서의 효과적인 추론을 위한 마르코프 사슬 몬테카를로(Markov chain Monte Carlo) 알고리즘을 소개하며, 대규모의 데이터세트에서의 실험을 통하여 베이지안 협력-경쟁 필터링이 협력-경쟁 필터링보다 우수함을 확인하였다.

Daejoong Kim, Jaedeug Choi, Kee-Eung Kim, Jungsu Lee, and Jinho Sohn: Engineering Statistical Dialog State Trackers: A Case Study on DSTC. Department of Computer Science, KAIST, Technical Report(CS-TR-2013-379). 2013. [📄 Abstract] [✏️ Paper]

We describe our experience with engineering the dialog state tracker for the ﬁrst Dialog State Tracking Challenge (DSTC). Dialog trackers are one of the essential components of dialog systems which are used to infer the true user goal from the speech processing results. We explain the main parts of our tracker: the observation model, the belief reﬁnement model, and the belief transformation model. We also report experimental results on a number of approaches to the models, and compare the overall performance of our tracker to other submitted trackers. This technical report is a companion to the shortened version presented at SIGDIAL 2013.

Daejoong Kim, Jaedeug Choi, Kee-Eung Kim, Jungsu Lee, and Jinho Sohn: Engineering Statistical Dialog State Trackers: A Case Study on DSTC. Proceedings of the SIGDIAL 2013 Conference, pp. 462-466. 2013. [📄 Abstract] [✏️ Paper]

We describe our experience with engineering the dialog state tracker for the first Dialog State Tracking Challenge (DSTC). Dialog trackers are one of the essential components of dialog systems which are used to infer the true user goal from the speech processing results. We explain the main parts of our tracker: the observation model, the belief refinement model, and the belief transformation model. We also report experimental results on a number of approaches to the models, and compare the overall performance of our tracker to other submitted trackers. An extended version of this paper is available as a technical report (Kim et al., 2013).

Jaedeug Choi and Kee-Eung Kim: Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2013. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Most of the algorithms for inverse reinforcement learning (IRL) assume that the reward function is a linear function of the pre-defined state and action features. However, it is often difficult to manually specify the set of features that can make the true reward function representable as a linear function. We propose a Bayesian nonparametric approach to identifying useful composite features for learning the reward function. The composite features are assumed to be the logical conjunctions of the predefined atomic features so that we can represent the reward function as a linear function of the composite features. We empirically show that our approach is able to learn composite features that capture important aspects of the reward function on synthetic domains, and predict taxi drivers' behaviour with high accuracy on a real GPS trace dataset.

2012

Jaedeug Choi and Kee-Eung Kim: Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions. Advances in Neural Information Processing Systems (NIPS). 2012. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains.

Dongho Kim, Kee-Eung Kim, and Pascal Poupart: Cost-Sensitive Exploration in Bayesian Reinforcement Learning. Advances in Neural Information Processing Systems (NIPS). 2012. [📄 Abstract] [✏️ Paper]

In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected long term total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems.

이강훈, 임희진, and 김기응: A POMDP Approach to Optimizing P300 Speller BCI Paradigm. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 20(4). 2012. [📄 Abstract] [✏️ Paper] [🔗 Link]

To achieve high performance in brain-computer interfaces (BCIs) using P300, most of the work has been focused on feature extraction and classification algorithms. Although significant progress has been made in such signal processing methods in the lower layer, the issues in the higher layer, specifically determining the stimulus schedule in order to identify the target reliably and efficiently, remain relatively unexplored. In this paper, we propose a systematic approach to compute an optimal stimulus schedule in P300 BCIs. Our approach adopts the partially observable Markov decision process, which is a model for planning in partially observable stochastic environments. We show that the thus obtained stimulus schedule achieves a significant performance improvement in terms of the success rate, bit rate, and practical bit rate through human subject experiments.

이강훈, 임희진, and 김기응: Factored POMDP를 이용한 가상군의 자율행위 모델링 사례연구. 한국컴퓨터종합학술대회 논문집, vol. 39(1B). 2012. [📄 Abstract] [✏️ Paper]

가상군의 자율행위 모델링은 전장모의모델링 시스템의 성능을 결정하는 주요한 요소이다. 불확실한 상황을 확률적으로 고려하여 최적의 의사결정을 가능하게 하는 POMDP (partially observable Markov decision process) 모델은 가상군의 자율행위 모델링에 있어서 매우 자연스러운 프레임워크이다. 그러나 POMDP 모델의 높은 계산복잡도로 인한 최적 행동정책 계산의 어려움은 POMDP 모델을 이용한 가상군의 자율행위 모델링을 저해하는 요소이다. 본 논문에서는 대규모 가상군의 자율행위 모델링을 위해 factored POMDP 모델을 이용한다. 그리고 "Hasty Defense" 사례연구를 통해 그 효과를 확인한다. 가상군의 자율행위 모델링은 전장모의모델링 시스템의 성능을 결정하는 주요한 요소이다. 불확실한 상황을 확률적으로 고려하여 최적의 의사결정을 가능하게 하는 POMDP (partially observable Markovdecision process) 모델은 가상군의 자율행위 모델링에 있어서 매우 자연스러운 프레임워크이다. 그러나 POMDP 모델의 높은 계산복잡도로 인한 최적 행동정책 계산의 어려움은 POMDP 모델을 이용한 가상군의 자율행위 모델링을 저해하는 요소이다. 본 논문에서는 대규모 가상군의 자율행위 모델링을 위해 factored POMDP 모델을 이용한다. 그리고 "Hasty Defense" 사례연구를 통해 그 효과를 확인한다.

Byung Kon Kang and Kee-Eung Kim: Exploiting Symmetries for Single and Multi-Agent Partially Observable Stochastic Domains. Artificial Intelligence, 182-183:32-57. 2012. [📄 Abstract] [✏️ Paper]

While Partially Observable Markov Decision Processes (POMDPs) and their multi-agent extension Partially Observable Stochastic Games (POSGs) provide a natural and systematic approach to modeling sequential decision making problems under uncertainty, the computational complexity with which the solutions are computed is known to be prohibitively expensive. In this paper, we show how such high computational resource requirements can be alleviated through the use of symmetries present in the problem. The problem of finding the symmetries can be cast as a graph automorphism (GA) problem on a graphical representation of the problem. We demonstrate how such symmetries can be exploited in order to speed up the solution computation and provide computational complexity results.

김동호, 이재송, 최재득, and 김기응: 복수 무인기를 위한 POMDP 기반 동적 임무 할당 및 정찰 임무 최적화 기법. 정보과학회 논문지: 소프트웨어 및 응용, 39(6). 2012. [📄 Abstract] [✏️ Paper]

최근 무인항공기의 제작 기술이 발전함에 따라, 농업, 재해 관측용 등의 민간 용도 뿐만 아니라 정찰 및 공격 등의 군사적 목적으로 다수의 무인기를 사용하는 다양한 시도가 진행되고 있다. 그러나 다수의 무인기를 사용할 때에 각 무인기를 사람이 직접 제어하는 데에는 어려움이 많으므로, 주어진 목표를 달성하기 위해서 자율적으로 협력하며 효과적인 행동을 수행하는 알고리즘의 개발이 필수적이다. 이러한 문제는 순차적 의사결정 문제로 생각할 수 있으며, 마코프 의사결정 과정(Markov Decision Processes; MDPs)과 이를 부분적 혹은 부정확한 관찰값을 다룰 수 있도록 확장한 부분관찰 마코프 의사결정 과정 (Partially Observable MDPs; POMDPs) 등의 대표적인 의사결정이론 모델을 이용하여 복잡하고 불확실한 환경에서의 의사결정 문제를 통계적으로 다룰 수 있다. 본 논문에서는 복수의 무인기를 이용할 때 동적 임무 할당 및 정찰 임무 문제를 POMDP를 이용하여 효율적으로 최적화할 수 있음을 보이고, 센서의 관찰값에 오차가 발생할 수 있는 경우, MDP에 비해 POMDP를 이용할 때 더 좋은 성능을 얻을 수 있음을 보인다. 또한 실제 쿼드콥터(quadcopter)를 이용하여 POMDP 정책이 실제 환경에서도 잘 동작함을 시뮬레이션을 통해 입증하였다.

2011

Jaedeug Choi and Kee-Eung Kim: MAP Inference for Bayesian Inverse Reinforcement Learning. Advances in Neural Information Processing Systems (NIPS). 2011. [📄 Abstract] [✏️ Paper]

The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms.

Jaeyoung Park, Kee-Eung Kim, and Yoon-Kyu Song: A POMDP-based Optimal Control of P300-based Brain-Computer Interfaces. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) NECTAR Track. 2011. [📄 Abstract] [✏️ Paper]

Most of the previous work on brain-computer interfaces (BCIs) exploiting the P300 in electroencephalography (EEG) has focused on low-level signal processing algorithms such as feature extraction and classification methods. Although a significant improvement has been made in the past, the accuracy of detecting P300 is limited by the inherently low signal-to-noise ratio in EEGs. In this paper, we present a systematic approach to optimize the interface using partially observable Markov decision processes (POMDPs). Through experiments involving human subjects, we show the P300 speller system that is optimized using the POMDP achieves a significant performance improvement in terms of the communication bandwidth in the interaction.

Dongho Kim, Jaesong Lee, Kee-Eung Kim, and Pascal Poupart: Point-Based Value Iteration for Constrained POMDPs. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2011. [📄 Abstract] [✏️ Paper]

Constrained partially observable Markov decision processes (CPOMDPs) extend the standard POMDPs by allowing the specification of constraints on some aspects of the policy in addition to the optimality objective for the value function. CPOMDPs have many practical advantages over standard POMDPs since they naturally model problems involving limited resource or multiple objectives. In this paper, we show that the optimal policies in CPOMDPs can be randomized, and present exact and approximate dynamic programming methods for computing randomized optimal policies. While the exact method requires solving a minimax quadratically constrained program (QCP) in each dynamic programming update, the approximate method utilizes the point-based value update with a linear program (LP). We show that the randomized policies are significantly better than the deterministic ones. We also demonstrate that the approximate point-based method is scalable to solve large problems.

Dongho Kim, Jaesong Lee, Kee-Eung Kim, and Pascal Poupart: Point-Based Value Iteration for Constrained POMDPs. Proceedings of the IJCAI Workshop on Decision Making in Partially Observable, Uncertain Worlds: Exploring Insights from Multiple Communities. 2011. [📄 Abstract] [✏️ Paper]

Constrained partially observable Markov decision processes (CPOMDPs) extend the standard POMDPs by allowing the specification of constraints on some aspects of the policy in addition to the optimality objective for the value function. CPOMDPs have many practical advantages over standard POMDPs since they naturally model problems involving limited resource or multiple objectives. In this paper, we show that the optimal policies in CPOMDPs can be randomized, and present exact and approximate dynamic programming methods for computing randomized optimal policies. While the exact method requires solving a minimax quadratically constrained program (QCP) in each dynamic programming update, the approximate method utilizes the point-based value update with a linear program (LP). We show that the randomized policies are significantly better than the deterministic ones. We also demonstrate that the approximate point-based method is scalable to solve large problems.

Eunsoo Oh and Kee-Eung Kim: A Geometric Traversal Algorithm for Reward-Uncertain MDPs. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). 2011. [📄 Abstract] [✏️ Paper]

Markov decision processes (MDPs) are widely used in modeling decision making problems in stochastic environments. However, precise specification of the reward functions in MDPs is often very difficult. Recent approaches have focused on computing an optimal policy based on the minimax regret criterion for obtaining a robust policy under uncertainty in the reward function. One of the core tasks in computing the minimax regret policy is to obtain the set of all policies that can be optimal for some candidate reward function. In this paper, we propose an efficient algorithm that exploits the geometric properties of the reward function associated with the policies. We also present an approximate version of the method for further speed up. We experimentally demonstrate that our algorithm improves the performance by orders of magnitude.

김동호, 이재송, 김기응, and 파스칼 푸파르: 제약을 갖는 POMDP를 위한 점-기반 가치 반복 알고리즘. 한국컴퓨터종합학술대회 논문집, vol. 38(1A). 2011. 최우수 논문상 [📄 Abstract] [✏️ Paper]

제약을 갖는 부분 관찰 의사결정 과정 (Constrained Partially Observable Markov Decision Process; CPOMDP)는 정책이 제약 (constraint)를 만족하면서 가치 함수를 최적화하도록 일반적인 부분 관찰 의사결정과정 (POMDP)을 확장한 모델이다. CPOMDP는 제한된 자원을 가지거나 여러 개의 목적 함수를 가지는 문제를 자연스럽게 모델링할 수 있기 때문에 일반적인 POMDP에 비해 더 실용적인 장점을 가진다. 본 논문에서는 CPOMDP의 확률적 최적 정책 및 근사 최적 정책을 계산할 수 있는 최적 및 근사 동적 프로그래밍 알고리즘을 제안한다. 최적 알고리즘은 동적 프로그래밍의 각 단계미다 미니맥스 이차 제약 계획 문제를 계산해야하는 반면에 근사 알고리즘은 선형 계획 문제만을 필요로 하는 점-기반 (point-based) 가치 업데이트를 이용한다. 실험 결과, 확률적 정책이 결정적 (deterministic) 정책보다 더 나은 성능을 보이며, 근사 알고리즘을 통해 계산 시간을 줄일 수 있음을 보였다.

Pascal Poupart, Kee-Eung Kim, and Dongho Kim: Closing the Gap: Towards Provably Optimal POMDP Solutions. Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS). 2011. [📄 Abstract] [✏️ Paper]

POMDP algorithms have made significant progress in recent years by allowing practitioners to find good solutions to increasingly large problems. Most approaches (including point-based and policy iteration techniques) operate by refining a lower bound of the optimal value function. Several approaches (e.g., HSVI2, SARSOP, grid-based approaches and online forward search) also refine an upper bound. However, approximating the optimal value function by an upper bound is computationally expensive and therefore tightness is often sacrificed to improve efficiency (e.g., sawtooth approximation). In this paper, we describe a new approach to efficiently compute tighter bounds by i) conducting a prioritized breadth first search over the reachable beliefs, ii) propagating upper bound improvements with an augmented POMDP and iii) using exact linear programming (instead of the sawtooth approximation) for upper bound interpolation. As a result, we can represent the bounds more compactly and significantly reduce the gap between upper and lower bounds on several benchmark problems.

Dongho Kim, Jin Hyung Kim, and Kee-Eung Kim: Robust Performance Evaluation of POMDP-Based Dialogue Systems. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 19(4). 2011. [📄 Abstract] [✏️ Paper] [🔗 Link]

Partially observable Markov decision processes (POMDPs) have received significant interest in research on spoken dialogue systems, due to among many benefits its ability to naturally model the dialogue strategy selection problem under unreliable automated speech recognition. However, the POMDP approaches are essentially model-based, and as a result, the dialogue strategy computed from POMDP is still subject to the correctness of the model. In this paper, we extend some of the previous MDP user models to POMDPs, and evaluate the effects of user models on the dialogue strategy computed from POMDPs. We experimentally show that the strategies computed from POMDPs perform better than those from MDPs, and the strategies computed from poor user models fail severely when tested on different user models. This paper further investigates the evaluation methods for dialogue strategies, and proposes a method based on the bias-variance analysis for reliably estimating the dialogue performance.

Jaedeug Choi and Kee-Eung Kim: Inverse Reinforcement Learning in Partially Observable Environments. Journal of Machine Learning Research (JMLR), 12. 2011. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert's behavior, namely the case in which the expert's policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings.

김동호 and 김기응: 부분관찰 마코프 의사결정과정을 이용한 지능형 에이전트 구현. 한국정보과학회지, 29(2). 2011. [📄 Abstract] [✏️ Paper]

본 고에서는 MDP 및 POMDP의 방법론을 소개하고 응용사례를 살펴보며, 특히 대화 기반 시스템과 뇌-컴퓨터 인터페이스의 POMDP 적용 사례에 대해 논한다. 또한 POMDP에 관련하여 현재 최신 기술 현황과 중요 연구 주제에 대해서 살펴본다.

2010

Wonjun Lee, Sunjun Kim, Younkyung Lim, Alice Oh, Tekjin Nam, and Kee-Eung Kim: A Rapid Prototyping Method for Discovering User-Driven Opportunities for Personal Informatics. Proceedings of the International Conference on Virtual Systems and Multimedia (VSMM). 2010. Best Paper Award [📄 Abstract] [✏️ Paper]

We present our ideas for a ubiquitous computing application for family life and happiness driven by human-centered discovery. We are particularly interested in the potential of personal informatics on discovering how “knowing thyself” can help us understand what people truly value in their lives. In this position paper, we discuss a new prototyping approach in which we apply the concept of personal informatics to enable designers and developers to discover potentially viable opportunities for personal lifecare systems for family members to promote their happiness and family values by using the tools of ubiquitous computing.

Younkyung Lim, Alice Oh, Tekjin Nam, and Kee-Eung Kim: Personal Informatics for Discovering Human-Centered Lifecare System Opportunities. Proceedings of the ACM CHI Workshop on Know Thyself. 2010. [📄 Abstract] [✏️ Paper]

We present our ideas for a ubiquitous computing application for family life and happiness driven by human-centered discovery. We are particularly interested in the potential of personal informatics on discovering how “knowing thyself” can help us understand what people truly value in their lives. In this position paper, we discuss a new prototyping approach in which we apply the concept of personal informatics to enable designers and developers to discover potentially viable opportunities for personal lifecare systems for family members to promote their happiness and family values by using the tools of ubiquitous computing.

Jaeyoung Park, Kee-Eung Kim, and Sungho Jo: A POMDP Approach to P300-Based Brain-Computer Interfaces. Proceedings of the ICAPS POMDP Practitioners Workshop. 2010. [📄 Abstract] [✏️ Paper]

Most of the previous work on non-invasive brain-computer interfaces (BCIs) has been focused on feature extraction and classification algorithms to achieve high performance for the communication between the brain and the computer. While significant progress has been made in the lower layer of the BCI system, the issues in the higher layer have not been sufficiently addressed. Existing P300-based BCI systems, for example the P300 speller, use a random order of stimulus sequence for eliciting P300 signal for identifying users' intentions. This paper is about computing an optimal sequence of stimulus in order to minimize the number of stimuli, hence improving the performance. To accomplish this, we model the problem as a partially observable Markov decision process (POMDP), which is a model for planning in partially observable stochastic environments. Through simulation and human subject experiments, we show that our approach achieves a significant performance improvement in terms of the success rate and the bit rate.

Youngwook Kim and Kee-Eung Kim: Point-Based Bounded Policy Iteration for Decentralized POMDPs. Proceedings of Pacific-Rim Conference on Artificial Intelligence (PRICAI) / Lecture Notes in Computer Science (LNCS) 6230. 2010. Best Poster Award [📄 Abstract] [🔗 Link]

We present a memory-bounded approximate algorithm for solving infinite-horizon decentralized partially observable Markov decision processes (DEC-POMDPs). In particular, we improve upon the bounded policy iteration (BPI) approach, which searches for a locally optimal stochastic finite state controller, by accompanying reachability analysis on controller nodes. As a result, the algorithm has different optimization criteria for the reachable and the unreachable nodes, and it is more effective in the search for an optimal policy. Through experiments on benchmark problems, we show that our algorithm is competitive to the recent nonlinear optimization approach, both in the solution time and the policy quality.

Jaeyoung Park, Kee-Eung Kim, and Sungho Jo: A POMDP Approach to P300-Based Brain-Computer Interfaces. Proceedings of the ACM International Conference on Intelligent User Interfaces (IUI). 2010. [📄 Abstract] [✏️ Paper]

Most of the previous work on non-invasive brain-computer interfaces (BCIs) has been focused on feature extraction and classification algorithms to achieve high performance for the communication between the brain and the computer. While significant progress has been made in the lower layer of the BCI system, the issues in the higher layer have not been sufficiently addressed. Existing P300-based BCI systems, for example the P300 speller, use a random order of stimulus sequence for eliciting P300 signal for identifying users' intentions. This paper is about computing an optimal sequence of stimulus in order to minimize the number of stimuli, hence improving the performance. To accomplish this, we model the problem as a partially observable Markov decision process (POMDP), which is a model for planning in partially observable stochastic environments. Through simulation and human subject experiments, we show that our approach achieves a significant performance improvement in terms of the success rate and the bit rate.

2009

Jaedeug Choi and Kee-Eung Kim: Inverse Reinforcement Learning in Partially Observable Environments. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2009. [📄 Abstract] [✏️ Paper] [🧑‍💻 Code]

Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behaviour of an expert. Most of the existing algorithms for IRL assume that the expert's environment is modeled as a Markov decision process (MDP), although they should be able to handle partially observable settings in order to widen the applicability to more realistic scenarios. In this paper, we present an extension of the classical IRL algorithm by Ng and Russell to partially observable environments. We discuss technical issues and challenges, and present the experimental results on some of the benchmark partially observable domains.

2008

Dongho Kim, Hyeong Seop Sim, Kee-Eung Kim, Jin Hyung Kim, Hyunjeong Kim, and Joo Won Sung: Effects of User Modeling on POMDP-based Dialogue Systems. Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). 2008. Best Student Paper Runner-up [📄 Abstract] [✏️ Paper]

Partially observable Markov decision processes (POMDPs) have gained significant interest in research on spoken dialogue systems, due to among many benefits its ability to naturally model the dialogue strategy selection problem under the unreliability in automated speech recognition. However, the POMDP approaches are essentially model-based, and as a result, the dialogue strategy computed from POMDP is subject to the correctness of the model. In this paper, we extend some of the previous user models for POMDPs, and evaluate the effects of user models on the dialogue strategy computed from POMDP.

Jae-Hyun Seok, Simon Levasseur, Kee-Eung Kim, and Jin Hyung Kim: Tracing Handwriting on Paper Document under Video Camera. Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR). 2008. [📄 Abstract] [✏️ Paper]

This paper describes a system that traces handwriting on paper document under overlooking video camera. This work is motivated to capture annotations on paper documents written by ordinary pen as an input to computer. As the trajectory of the pen tip is extracted from the video, each part of the trajectory is classified as 'pen-down' or 'pen-up', according to whether the part makes a dark line. Detecting written inks is not simple when handwriting is made over printed documents. Because written inks may fall on dark regions of the document and often overlap previously written inks, simple background checking may not work on dark regions. So, we interpolated the decisions at the entering and the exiting of the dark region. The system makes two-level decisions to achieve both speed and accuracy. The classifier makes quick decisions based on local information in order not to lose pen trace. The local pen up-down decisions are corrected in the global point of view when the whole information of the writing process is available, such as when the hand is out of the view. Experimental result shows that the system detects handwritings accurately even on printed documents.

Hyeong Seop Sim, Kee-Eung Kim, Jin Hyung Kim, Du-Seong Chang, and Myoung-Wan Koo: Symbolic Heuristic Search Value Iteration for Factored POMDPs. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2008. [📄 Abstract] [✏️ Paper]

We propose Symbolic heuristic search value iteration (Symbolic HSVI) algorithm, which extends the heuristic search value iteration (HSVI) algorithm in order to handle factored partially observable Markov decision processes (factored POMDPs). The idea is to use algebraic decision diagrams (ADDs) for compactly representing the problem itself and all the relevant intermediate computation results in the algorithm. We leverage Symbolic Perseus for computing the lower bound of the optimal value function using ADD operators, and provide a novel ADD-based procedure for computing the upper bound. Experiments on a number of standard factored POMDP problems show that we can achieve an order of magnitude improvement in performance over previously proposed algorithms.

Kee-Eung Kim: Exploiting Symmetries in POMDPs for Point-Based Algorithms. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2008. [📄 Abstract] [✏️ Paper]

We extend the model minimization technique for partially observable Markov decision processes (POMDPs) to handle symmetries in the joint space of states, actions, and observations. The POMDP symmetry we define in this paper cannot be handled by the model minimization techniques previously published in the literature. We formulate the problem of finding the symmetries as a graph automorphism (GA) problem, and although not yet known to be tractable, we experimentally show that the sparseness of the graph representing the POMDP allows us to quickly find symmetries. We show how the symmetries in POMDPs can be exploited for speeding up point-based algorithms. We experimentally demonstrate the effectiveness of our approach

2007

Jihoon Kim, Taik Heon Rhee, Kee-Eung Kim, and Jin Hyung Kim: Place Recognition Using Multiple Wearable Cameras. Proc. of 4th International Symposium on Ubiquitous Computing Systems (UCS) / Lecture Notes in Computer Science (LNCS) 4836. 2007. [📄 Abstract] [✏️ Paper]

Recognizing a user's location is the most challenging problem for providing intelligent location-based services. In this paper, we presented a realtime camera-based system for the place recognition problem. This system takes streams of scene images of a learned environment from user-worn cameras and produces the class label of the current place as an output. Multiple cameras are used to collect multi-directional scene images because utilizing multiple images yields better and robust recognition than a single image. For more robust recognition, we utilized spatial relationships between the places. In addition that, a temporal reasoning is incorporated with a Markov model to reflect typical staying time at each place. Recognition experiments, which were conducted in a real environment in a university campus, showed that the proposed method yields a very promising result.

Jihoon Kim, Taik Heon Rhee, Kee-Eung Kim, and Jin Hyung Kim: Signboard Recognition by Consistency Checking of Local Features. 2nd Korea-Japan Joint Workshop on Pattern Recognition (KJPR). 2007. [📄 Abstract] [✏️ Paper]

The problem of recognizing signboards in street scenes is defined as matching the input image to pre-stored 2D signboard images. This problem is not as simple as it appears to be due to arbitrary drawings and relative 3D positions. We approached this problem by matching characteristic local features of input image to those of images in the database. Local decisions are verified by the global viewpoint of the homographic consistency and color consistency. The well-known SIFT feature is used as a local feature and the homographic consistency checking is performed using RANSAC, a random sampling method. In order to handle highly perspective-distorted signboards, several perspective-transformed templates are generated offline. In our experiment, with a database of 35 images, our proposed method achieved 95% recognition rate, showing good results despite the highly distorted input images.

2006

Kee-Eung Kim, Wook Chang, Sung-Jung Cho, Junghyun Shim, Hyunjeong Lee, Joonah Park, Youngbeom Lee, and Sangryoung Kim: Hand Grip Pattern Recognition for Mobile User Interfaces. Proceedings of the Innovative Applications of Artificial Intelligence Conference (IAAI). 2006. [📄 Abstract] [✏️ Paper]

This paper presents a novel user interface for handheld mobile devices by recognizing hand grip patterns. Particularly, we consider the scenario where the device is provided with an array of capacitive touch sensors underneath the exterior cover. In order to provide the users with intuitive and natural manipulation experience, we use pattern recognition techniques for identifying the users' hand grips from the touch sensors. Preliminary user studies suggest that filtering out unintended user hand grip is one of the most important issues to be resolved. We discuss the details of the prototype implementation, as well as engineering challenges for practical deployment.

Wook Chang, Kee-Eung Kim, Hyunjeong Lee, Joon Kee Cho, Byung Seok Soh, Jung Hyun Shim, Gyunghye Yang, Sung-Jung Cho, and Joonah Park: Recognition of Grip-Patterns by using Capacitive Touch Sensors. Proceedings of the IEEE International Symposium on Industrial Electronics (ISIE). 2006. [📄 Abstract] [✏️ Paper]

A novel and intuitive way of accessing applications of mobile devices is presented. The key idea is to use grip-pattern, which is naturally produced when a user tries to use the mobile device, as a clue to determine an application to be launched. To this end, a capacitive touch sensor system is carefully designed and installed underneath the housing of the mobile device to capture the information of the user's grip-pattern. The captured data is then recognized by a minimum distance classifier and a naive Bayes classifier. The recognition test is performed to validate the feasibility of the proposed user interface system.

Kee-Eung Kim, Taeseo Park, Min-Kyu Park, Youngbeom Lee, Yunbae Kim, and Sangryoung Kim: Adaptive Event Clustering for Personalized Photo Browsing. 한국 HCI 학술대회 논문집 (Proceedings of Korean HCI Conference). 2006. [📄 Abstract] [✏️ Paper]

Since the introduction of digital camera to the mass market, the number of digital photos owned by an individual is growing at an alarming rate. This phenomenon naturally leads to the issues of difficulties while searching and browsing in the personal digital photo archive. Traditional approach typically involves content-based image retrieval using computer vision algorithms. However, due to the performance limitations of these algorithms, at least on the casual digital photos taken by non-professional photographers, more recent approaches are centered on time-based clustering algorithms, analyzing the shot times of photos. These time-based clustering algorithms are based on the insight that when these photos are clustered according to the shot-time similarity, we have “event clusters” that will help the user browse through her photo archive. It is also reported that one of the remaining problems with the time-based approach is that people perceive events in different scales. In this paper, we present an adaptive time-based clustering algorithm that exploits the usage history of digital photos in order to infer the user's preference on the event granularity. Experiments show significant performance improvements in the clustering accuracy.

Wook Chang, Kee-Eung Kim, Hyunjeong Lee, Joonki Cho, Byeongsuk Soh, Junghyun Shim, Kyunghye Yang, Sung-Jung Cho, and Junah Park: Designing Mobile User Interfaces Using Hand Grip Recognition. 한국 HCI 학술대회 논문집 (Proceedings of Korean HCI Conference). 2006.

2005

SeongHwan Cho and Kee-Eung Kim: Variable Bandwidth Allocation Scheme for Energy Efficient Wireless Sensor Network. Proceedings of the IEEE International Conference on Communications (ICC). 2005. [📄 Abstract] [✏️ Paper]

Increasing the lifetime of wireless sensors is essential for the proliferation of wireless sensor networks in various environments. In this paper, the relationship between bandwidth and energy consumption is exploited to increase the lifetime of the sensors. A variable bandwidth allocation scheme that uses time-frequency slot assignment is proposed to reduce the energy consumption of a collaborative sensor network which has large spatial variation in node density and event rates. To assign the time-frequency slots to the sensor network, a novel algorithm is presented, which results in significant energy savings over the conventional constant bandwidth allocation scheme.

Wook Chang, Juna Park, Kee-Eung Kim, Sung-Jung Cho, Hyun-Jung Lee, and Junghyun Shim: 접촉 센서를 이용한 사용자 인터페이스 설계 (Designing a Touch-based User Interface System for Handheld Devices). 한국 HCI 학술대회 논문집 (Proceedings of Korean HCI Conference). 2005. [📄 Abstract] [✏️ Paper]

This paper proposes a new interaction system for portable devices that combines two different types of sensors: a set of capacitive touch sensors and an accelerometer. The touch sensing system of this device can detect multiple finger-touches and finger proximity to the surface, while traditional touch sensing systems such as touchpad usually focus on recognizing the position of a single finger. In addition, a tri-axis accelerometer is applied to measure the motion information such as the inclination angle and vibration of the system caused by a user. Combining multi-finger touch and motion information, the proposed system provides users with a game-like experience by enhancing contextual navigation and realistic manipulation.

2003

Kee-Eung Kim and Thomas Dean: Solving Factored MDPs Using Non-Homogeneous Partitions. Artificial Intelligence, 147(1-2). 2003.

2002

Kee-Eung Kim and Thomas Dean: Solving Factored MDPs with Large Action Space Using Algebraic Decision Diagrams. Proceedings of Pacific-Rim Conference on Artificial Intelligence (PRICAI) / Lecture Notes in Computer Science (LNCS) 2417. 2002.

2001

Kee-Eung Kim and Thomas Dean: Solving Factored MDPs via Non-homogeneous Partitioning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI). 2001.

Nicolas Meuleau, Leonid Peshkin, and Kee-Eung Kim: Exploration in Gradient-based Reinforcement Learning. MIT, AI Memo(2001-003). 2001.

2000

Kee-Eung Kim, Thomas Dean, and Nicolas Meuleau: Approximate Solutions to Factored Markov Decision Processes via Greedy Search in the Space of Finite State Controllers. Proceedings of the Fifth International Conference on Artificial Intelligence in Planning and Scheduling (AIPS). 2000.

Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie Pack Kaelbling: Learning to Cooperate via Policy Search. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI). 2000.

Kee-Eung Kim, Thomas Dean, and Samuel Hazlehurst: Linear Algebra in Very High-Dimension Vector Spaces With an Application to Solving Markov Decision Processes. Neural Computing Surveys, 3. 2000.

Kee-Eung Kim, Thomas Dean, and Samuel Hazlehurst: Linear Algebra in Very High-Dimension Vector Spaces: Algorithms and Data Structures for Implementing Exact and Approximate Solution Methods. Department of Computer Science, Brown University, Technical Report(CS-00-02). 2000.

1999

Thomas Dean, Kee-Eung Kim, and Samuel Hazlehurst: Linear Algebra in Very High-Dimension Vector Spaces With an Application to Solving Markov Decision Processes. Proceedings of IJCAI-99 Workshop on Statistical Machine Learning for Large-Scale Optimization. 1999.

Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling: Learning Finite-State Controllers for Partially Observable Environments. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). 1999.

Nicolas Meuleau, Kee-Eung Kim, Leslie Pack Kaelbling, and Anthony R. Cassandra: Solving POMDPs by Searching the Space of Finite Policies. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). 1999.

1998

Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier: Solving Very Large Weakly Coupled Markov Decision Processes. Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI). 1998.

Thomas Dean, Kee-Eung Kim, and Robert Givan: Solving Planning Problems with Large State and Action Spaces. Proceedings of the Fourth International Conference on Artificial Intelligence Planning Systems (AIPS). 1998.