TextWorld AI Agent Hello World

by dokuDoku ai agents DoKu88/textWorldAgent

🔍

# Hello World for AI Agents: Building Game-Playing Agents

This is a Hello World example for building AI Agents. The point of the [TextWorld Agent Repository](https://github.com/DoKu88/textWorldAgent) is to show how to get started building AI Agents and their simple components. This is very similar to the reinforcement learning environments from [OpenAI Gym](https://gymnasium.farama.org/) to train reinforcement learning agents, where the standardized environment interface begets a standardized agent one to interact with it.

I did not do any training for this project, since ChatGPT worked fantastic out of the box, and I have decided to focus on training for a harder AI Agent task.

## The Problem: Text Adventure Games

Text adventure games present the player with a description of their environment and accept natural language commands:

```
You are in a small kitchen. There's a table with an apple on it.
A door leads north to the living room.

> take apple
You pick up the apple.

> go north
You enter the living room...
```

These games used to be surprisingly hard for AI. They require:
- Understanding natural language descriptions
- Maintaining memory of what you've seen and done
- Planning multi-step sequences toward goals
- Selecting valid actions from a constrained set

I used [TextWorld](https://github.com/microsoft/TextWorld), Microsoft's framework for procedurally generated text games, as the environment. The question was: how do you build an agent architecture that's clean, extensible, and actually works?

## The Core Insight: The Base Class is the Framework

The heart of this project is a single abstract class: `BaseAgent`. It handles everything common to all agents:

```python
class BaseAgent(ABC):
"""Abstract base agent for TextWorld using Pydantic I/O."""

def __init__(self, history_length: int = 0, objective_mode: str = "explicit"):
self._history: List[tuple[str, str]] = []
self._history_length = history_length
self._objective_mode = objective_mode

@abstractmethod
def act(self, agent_input: AgentInput) -> AgentOutput:
"""Subclasses implement this to select an action."""
pass

def __call__(self, observation, score, done, info) -> str:
"""Callable interface matching TextWorld's expected signature. 
   Convenience function."""
agent_input = AgentInput.from_textworld(
observation, score, done, info,
self._objective_mode
)
output = self.act(agent_input)
self._record_history(observation, output.action)
return output.action
```

This design means:
- **One method to implement**: New agents only need to define `act()`
- **Shared infrastructure**: History tracking, prompt building, action parsing—all handled
- **Consistent interface**: Every agent looks the same to the runner

The base class provides sensible defaults for the hard parts. Building the prompt. Cleaning observations (removing ASCII art artifacts). Parsing LLM responses into valid commands. These are solved problems that our agents need.

## Type-Safe I/O with Pydantic

One note is that using type safe inputs and outputs out of your agent enables more predictable behavior for writing and debugging code. Without guardrails, you may end up with `KeyError` exceptions and silent failures.

Pydantic models fix this:

```python
@dataclass
class AgentInput:
observation: str
score: int
done: bool
admissible_commands: List[str]
inventory: Optional[str] = None
max_score: Optional[int] = None
objective: Optional[str] = None

@classmethod
def from_textworld(cls, observation, score, done, info, objective_mode):
"""Factory method that handles all the TextWorld-specific weirdness."""
# Extract and validate fields...
return cls(...)

@dataclass
class AgentOutput:
action: str
reasoning: Optional[str] = None
confidence: Optional[float] = None
```

Every agent receives an `AgentInput` and returns an `AgentOutput`.

## The Inheritance Pattern: Strategy via Subclassing

With the base class enforcing agents' frameworks, implementing new agents becomes trivial. Here are three agents I implemented:

### RandomAgent (Baseline)

```python
class RandomAgent(BaseAgent):
"""Selects randomly from valid actions. The simplest possible agent."""

def act(self, agent_input: AgentInput) -> AgentOutput:
action = random.choice(agent_input.admissible_commands)
return AgentOutput(action=action)
```

### TransformersAgent (Local LLM)

```python
class TransformersAgent(BaseAgent):
"""Uses HuggingFace models for local inference."""

def __init__(self, model_name: str, **kwargs):
super().__init__(**kwargs)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
# Auto-detect model type
if "t5" in model_name.lower() or "flan" in model_name.lower():
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
else:
self.model = AutoModelForCausalLM.from_pretrained(model_name)

def act(self, agent_input: AgentInput) -> AgentOutput:
system_prompt, user_prompt, options = self._build_prompt(agent_input)
# Tokenize, generate, decode, parse...
return AgentOutput(action=parsed_action)
```

### AgentOpenAI (OpenAI API LLM)

```python
class AgentOpenAI(BaseAgent):
"""Uses OpenAI's API for cloud-based inference."""

def __init__(self, model: str = "gpt-4o-mini", **kwargs):
super().__init__(**kwargs)
self.client = OpenAI()
self.model = model

def act(self, agent_input: AgentInput) -> AgentOutput:
system_prompt, user_prompt, options = self._build_prompt(agent_input)
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0
)
action = self._parse_action(response.choices[0].message.content, options)
return AgentOutput(action=action)
```

The pattern is clear: inherit from `BaseAgent`, call `_build_prompt()` to get a standardized prompt, run your inference, call `_parse_action()` to extract the command.

## The Factory Design Pattern: Configuration-Driven Agent Creation

Hardcoding agent types is fragile and is non-scalable for adding new agents. The ubiquitous Factory Design Pattern is made to solve this issue:

```python
class AgentFactory:
_registry: Dict[str, Type[BaseAgent]] = {}

@classmethod
def register(cls, name: str, agent_class: Type[BaseAgent]):
cls._registry[name] = agent_class

@classmethod
def create(cls, name: str, **kwargs) -> BaseAgent:
if name not in cls._registry:
raise ValueError(f"Unknown agent: {name}")
return cls._registry[name](**kwargs)

# Registration
AgentFactory.register("random", RandomAgent)
AgentFactory.register("transformers", TransformersAgent)
AgentFactory.register("openai", AgentOpenAI)
```

Agent selection happens through configuration:

```yaml
# config/agent.yaml
agent:
type: openai # Change this line to switch agents
model: gpt-4o-mini
history_length: 3
objective_mode: explicit
```

```python
# main.py
agent_type = config["agent"]["type"]
agent = AgentFactory.create(agent_type, **agent_kwargs)
```

Adding agents is as easy as implementing a new agent class and registering it. Note that keeping a register of your subclasses isn't strictly necessary, but is good for bookkeeping.

## Prompt Engineering

I used the same prompt (enacted by the baseclass) for all the AI Agents. This was so I could have an apples to apples comparison, but this is a personal choice and different prompts work better on different LLM's obviously:

```python
def _build_prompt(self, agent_input: AgentInput) -> tuple[str, str, Dict[str, str]]:
# System prompt: establish the persona and rules
system = f"""You are an expert text adventure game player.
Your objective: {agent_input.objective}

Rules:
- Choose ONLY from the numbered options below
- Respond with just the number of your choice
- Think step by step about which action advances your goal"""

# User prompt: current state + history + options
user = ""
if self._history:
user += "Recent history:\n"
for obs, act in self._history[-self._history_length:]:
user += f"- You saw: {obs[:100]}...\n- You did: {act}\n"

user += f"\nCurrent situation:\n{agent_input.observation}\n"
user += f"\nInventory: {agent_input.inventory}\n"
user += "\nOptions:\n"

options = {}
for i, cmd in enumerate(agent_input.admissible_commands, 1):
user += f"{i}. {cmd}\n"
options[str(i)] = cmd

return system, user, options
```

Key insights:
- **Numbered options**: LLMs are better at "choose 1, 2, or 3" than generating exact command syntax
- **History context**: Including recent moves prevents loops and enables multi-step plans
- **Explicit rules**: Telling the model exactly what format to respond in reduces parsing failures

## Action Parsing:

LLMs don't always follow instructions. The parser handles this with a fallback chain. Be sure to use error handling like so since an LLM's output is non-deterministic:

```python
def _parse_action(self, response: str, options: Dict[str, str]) -> str:
response = response.strip()

# 1. Exact number match ("1" → first option)
if response in options:
return options[response]

# 2. Exact command match (case-insensitive)
for cmd in options.values():
if response.lower() == cmd.lower():
return cmd

# 3. Substring match (response contains the command)
for cmd in options.values():
if cmd.lower() in response.lower():
return cmd

# 4. Word overlap (most words in common)
response_words = set(response.lower().split())
best_match, best_score = None, 0
for cmd in options.values():
cmd_words = set(cmd.lower().split())
overlap = len(response_words & cmd_words)
if overlap > best_score:
best_match, best_score = cmd, overlap

if best_match and best_score > 0:
return best_match

raise RuntimeError(f"Could not parse action from: {response}")
```

This handles everything from "1" to "I'll go north" to "take the shiny brass lamp" when the command is "take lamp."

## Results:

With GPT-4o-mini and 3 steps of history, the agent solves simple TextWorld quests (finding and collecting objects) as expected.
The random baseline does not work well by contrast, and also as expected.

More importantly, the framework makes experimentation easy:
- Swap `objective_mode: explicit` for `objective_mode: abstract` to see how agents handle vague goals
- Change `history_length` to test memory requirements
- Compare local models vs. cloud APIs
- Collect trajectory data for fine-tuning later

## Some Takeaways

Some notes that I would like to share. These are pretty obvious, but are still good to note since they pop up in some form or another when I build new AI Agents:

**1. Type safety matters.** Preemptively catch bugs and verify your LLM's outputs.

**2. LLMs are capable reasoners.** With good prompts and constrained action spaces, off-the-shelf models can play games competently.

**3. Design for experimentation.** YAML configs, factory patterns, and clean interfaces mean you can iterate quickly. In general, just always try to move fast.