Research Statement

My research asks a central question: how can RL agents interpret and generalize human intent across tasks, modalities, and environments—without manually engineered rewards? Standard RL produces capable optimizers, but two fundamental gaps limit their deployment alongside people. First, they lack any channel for human communication: goals must be hand-coded at training time, with no mechanism for receiving or acting on natural-language instructions at inference. Second, even carefully specified objectives can produce behavior distributions misaligned with what humans actually want: proxy rewards are optimized faithfully, yet the resulting outputs routinely diverge from human expectations. I address both gaps through instruction-conditioned policy learning, multimodal behavioral grounding, and compositional cross-environment generalization.

Procedural content generation (PCG) for games serves as the primary experimental substrate— not as the end goal, but because it offers uniquely controllable and measurable conditions for studying these problems. Game environments provide structured procedural variation, precise evaluation metrics, and a direct observable gap between a specified intent and a generated output. This makes them ideal for developing and stress-testing methods in controllability, semantic grounding, and compositional adaptation— capabilities that matter well beyond the gaming domain.

"Can a single policy, conditioned on natural language, achieve continuous behavioral adaptation at inference time—without reward re-engineering— and transfer its grounded representation to environments and task combinations it has never encountered during training?"

Research Thrusts
Three connected thrusts—each a partial answer to the central question above— progressively extend the scope of controllability and generalization: (i) inference-time behavioral control via semantically structured instruction representations; (ii) multimodal grounding to resolve the inherent underspecification of language-only conditioning; and (iii) compositional cross-environment transfer through a unified latent space that supports recombination and adaptation across structurally distinct domains.
Thrust 1 · Inference-Time Behavioral Control

Semantically Structured Instruction Representations for Reward-Free Policy Adaptation

Can natural language serve as a continuous behavioral specification—enabling a single policy to adapt its behavior at inference time without any reward re-engineering?
IPCGRL preview IPCGRL demo

Standard RL policies are coupled to a fixed training objective: adapting to a new goal requires modifying the reward function and retraining from scratch, with no mechanism for receiving human instruction at deployment time. We address this by conditioning the policy on a semantically structured instruction embedding— encoded from a natural-language specification and injected directly into the policy network—so that varying the instruction alone continuously steers behavior at inference time, without any gradient update [1].

The approach extends naturally to the multi-objective setting. Rather than collapsing competing objectives into a scalar, we learn a disentangled representation that encourages each objective dimension to occupy a separable subspace. The policy can then traverse trade-offs across conflicting constraints by interpolating within the instruction space— continuously navigating the behavioral specification without retraining for individual configurations [2].

Thrust 2 · Multimodal Grounding

Resolving Behavioral Underspecification via Tri-Modal Contrastive Grounding

Language is inherently underspecified. Grounding a policy in complementary perceptual modalities sharpens intent resolution and reduces output variance.
Human-Aligned preview

A natural-language prompt is consistent with many distinct output distributions—the same instruction can reasonably describe very different behaviors, leaving the policy underspecified. We address this through a tri-modal contrastive grounding framework that jointly embeds three signal types— natural-language instructions, spatial layout observations, and output trajectories—into a unified metric space. The contrastive objective pulls together representations from different modalities that correspond to the same behavioral target, while pushing apart representations of distinct targets.

The policy accepts any available subset of modalities as input, allowing flexible deployment under partial observability. Combining text with spatial observations resolves much of the ambiguity that language alone cannot eliminate, reducing output variance in our evaluations. Human preference studies confirm that the grounded policy produces behavior that more faithfully matches the specified intent than single-modality baselines [3].

Thrust 3 · Compositional Cross-Environment Transfer

A Unified Latent Space for Language-Guided Recombination Across Structurally Distinct Environments

Can a single encoder learn a representation that transfers, interpolates, and recombines behaviors across environments never seen together during training?
Multiverse preview Multiverse demo

Thrusts 1–2 operate within a single environment. The harder question is whether a shared representation can capture structural regularities across multiple distinct environments and make those regularities accessible for reuse. We train a single encoder jointly across structurally distinct domains under a cross-environment alignment objective that draws together latent representations of behaviorally equivalent targets, while preserving environment-specific structure where needed. The result is a unified embedding space that supports transfer: knowledge acquired in one environment can be directly accessed and reused in another without fine-tuning.

A language instruction then parameterizes soft interpolation within this shared space, producing outputs that compositionally blend characteristics from distinct source environments—enabling zero-shot recombination of behaviors not seen in any single training domain [4]. This compositional structure is the foundation for the broader agenda: extending the same shared encoder to control policies could allow language-guided latent interpolation to directly produce behavioral blends across tasks, potentially enabling open-ended adaptation to novel task combinations at inference time.

Toward Grounded and Controllable AI Agents

The three thrusts above form the foundation for my broader research agenda: building interactive agents that can interpret, adapt to, and act on human intent across diverse tasks, modalities, and environments. Moving beyond controlled game settings, this agenda naturally extends to embodied and interactive systems, where agents must ground language in visual observations, temporal dynamics, tactile feedback, and real-time action.

The shared representations developed in my work provide a path toward open-ended compositional generalization, enabling agents to recombine learned goals, skills, and environmental concepts rather than memorizing fixed instruction–behavior pairs. More broadly, mechanisms such as semantically structured embeddings, multimodal grounding, inference-time controllability, and cross-domain transfer are relevant wherever AI systems must follow human instructions in ways that are verifiable, steerable, and robust to distributional shift. My long-term goal is to develop a principled framework for scalable, instruction-driven alignment in interactive agents— while preserving meaningful human control over agent behavior.

References

  1. I.-C. Baek*, S.-H. Kim*, S.-Y. Lee, D.-H. Lee, K.-J. Kim. "IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation." IEEE Conference on Games (CoG), 2025. (* Equal contribution)  [arXiv] [paper]
  2. S.-H. Kim, G.-H. Hwang, I.-C. Baek, S.-Y. Lee, K.-J. Kim. "Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation Reinforcement Learning." Accepted at IEEE Conference on Games (CoG), 2026.  [arXiv]
  3. I.-C. Baek*, S. Lee*, S.-H. Kim, G. Hwang, K.-J. Kim. "Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation." Under revision at IEEE Transactions on Games, 2025. (* Equal contribution)  [arXiv]
  4. I.-C. Baek*, J. Jung*, S.-H. Kim, G.-H. Hwang, K.-J. Kim. "Multiverse: Language-Conditioned Multi-Game Level Blending via Shared Representation." Accepted at IEEE Conference on Games (CoG), 2026. (* Equal contribution)  [arXiv] [code]
  5. I.-C. Baek, S.-H. Kim, S. Earle, Z. Jiang, J.-H. Noh, J. Togelius, K.-J. Kim. "PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning." Under revision at IEEE Transactions on Games, 2025.  [arXiv]