TextWorld AI Agent Hello World by dokuDoku ai agents DoKu88/textWorldAgent π # Hello World for AI Agents: Building Game-Playing Agents This is a Hello World example for building AI Agents. The point of the [TextWorld Agent Repository](https://github.com/DoKu88/textWorldAgent) is to show how to get started building AI Agents and their simple components. This is very similar to the reinforcement learning environments from [OpenAI Gym](https://gymnasium.farama.org/) to train reinforcement learning agents, where the standardized environment interface begets a standardized agent one to interact with it. I did not do any training for this project, since ChatGPT worked fantastic out of the box, and I have decided to focus on training for a harder AI Agent task. ## The Problem: Text Adventure Games Text adventure games present the player with a description of their environment and accept natural language commands: ``` You are in a small kitchen. There's a table with an apple on it. A door leads north to the living room. > take apple You pick up the apple. > go north You enter the living room... ``` These games used to be surprisingly hard for AI. They require: - Understanding natural language descriptions - Maintaining memory of what you've seen and done - Planning multi-step sequences toward goals - Selecting valid actions from a constrained set I used [TextWorld](https://github.com/microsoft/TextWorld), Microsoft's framework for procedurally generated text games, as the environment. The question was: how do you build an agent architecture that's clean, extensible, and actually works? ## The Core Insight: The Base Class is the Framework The heart of this project is a single abstract class: `BaseAgent`. It handles everything common to all agents: ```python class BaseAgent(ABC): """Abstract base agent for TextWorld using Pydantic I/O.""" def __init__(self, history_length: int = 0, objective_mode: str = "explicit"): self._history: List[tuple[str, str]] = [] self._history_length = history_length self._objective_mode = objective_mode @abstractmethod def act(self, agent_input: AgentInput) -> AgentOutput: """Subclasses implement this to select an action.""" pass def __call__(self, observation, score, done, info) -> str: """Callable interface matching TextWorld's expected signature. Convenience function.""" agent_input = AgentInput.from_textworld( observation, score, done, info, self._objective_mode ) output = self.act(agent_input) self._record_history(observation, output.action) return output.action ``` This design means: - **One method to implement**: New agents only need to define `act()` - **Shared infrastructure**: History tracking, prompt building, action parsingβall handled - **Consistent interface**: Every agent looks the same to the runner The base class provides sensible defaults for the hard parts. Building the prompt. Cleaning observations (removing ASCII art artifacts). Parsing LLM responses into valid commands. These are solved problems that our agents need. ## Type-Safe I/O with Pydantic One note is that using type safe inputs and outputs out of your agent enables more predictable behavior for writing and debugging code. Without guardrails, you may end up with `KeyError` exceptions and silent failures. Pydantic models fix this: ```python @dataclass class AgentInput: observation: str score: int done: bool admissible_commands: List[str] inventory: Optional[str] = None max_score: Optional[int] = None objective: Optional[str] = None @classmethod def from_textworld(cls, observation, score, done, info, objective_mode): """Factory method that handles all the TextWorld-specific weirdness.""" # Extract and validate fields... return cls(...) @dataclass class AgentOutput: action: str reasoning: Optional[str] = None confidence: Optional[float] = None ``` Every agent receives an `AgentInput` and returns an `AgentOutput`. ## The Inheritance Pattern: Strategy via Subclassing With the base class enforcing agents' frameworks, implementing new agents becomes trivial. Here are three agents I implemented: ### RandomAgent (Baseline) ```python class RandomAgent(BaseAgent): """Selects randomly from valid actions. The simplest possible agent.""" def act(self, agent_input: AgentInput) -> AgentOutput: action = random.choice(agent_input.admissible_commands) return AgentOutput(action=action) ``` ### TransformersAgent (Local LLM) ```python class TransformersAgent(BaseAgent): """Uses HuggingFace models for local inference.""" def __init__(self, model_name: str, **kwargs): super().__init__(**kwargs) self.tokenizer = AutoTokenizer.from_pretrained(model_name) # Auto-detect model type if "t5" in model_name.lower() or "flan" in model_name.lower(): self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name) else: self.model = AutoModelForCausalLM.from_pretrained(model_name) def act(self, agent_input: AgentInput) -> AgentOutput: system_prompt, user_prompt, options = self._build_prompt(agent_input) # Tokenize, generate, decode, parse... return AgentOutput(action=parsed_action) ``` ### AgentOpenAI (OpenAI API LLM) ```python class AgentOpenAI(BaseAgent): """Uses OpenAI's API for cloud-based inference.""" def __init__(self, model: str = "gpt-4o-mini", **kwargs): super().__init__(**kwargs) self.client = OpenAI() self.model = model def act(self, agent_input: AgentInput) -> AgentOutput: system_prompt, user_prompt, options = self._build_prompt(agent_input) response = self.client.chat.completions.create( model=self.model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ], temperature=0.0 ) action = self._parse_action(response.choices[0].message.content, options) return AgentOutput(action=action) ``` The pattern is clear: inherit from `BaseAgent`, call `_build_prompt()` to get a standardized prompt, run your inference, call `_parse_action()` to extract the command. ## The Factory Design Pattern: Configuration-Driven Agent Creation Hardcoding agent types is fragile and is non-scalable for adding new agents. The ubiquitous Factory Design Pattern is made to solve this issue: ```python class AgentFactory: _registry: Dict[str, Type[BaseAgent]] = {} @classmethod def register(cls, name: str, agent_class: Type[BaseAgent]): cls._registry[name] = agent_class @classmethod def create(cls, name: str, **kwargs) -> BaseAgent: if name not in cls._registry: raise ValueError(f"Unknown agent: {name}") return cls._registry[name](**kwargs) # Registration AgentFactory.register("random", RandomAgent) AgentFactory.register("transformers", TransformersAgent) AgentFactory.register("openai", AgentOpenAI) ``` Agent selection happens through configuration: ```yaml # config/agent.yaml agent: type: openai # Change this line to switch agents model: gpt-4o-mini history_length: 3 objective_mode: explicit ``` ```python # main.py agent_type = config["agent"]["type"] agent = AgentFactory.create(agent_type, **agent_kwargs) ``` Adding agents is as easy as implementing a new agent class and registering it. Note that keeping a register of your subclasses isn't strictly necessary, but is good for bookkeeping. ## Prompt Engineering I used the same prompt (enacted by the baseclass) for all the AI Agents. This was so I could have an apples to apples comparison, but this is a personal choice and different prompts work better on different LLM's obviously: ```python def _build_prompt(self, agent_input: AgentInput) -> tuple[str, str, Dict[str, str]]: # System prompt: establish the persona and rules system = f"""You are an expert text adventure game player. Your objective: {agent_input.objective} Rules: - Choose ONLY from the numbered options below - Respond with just the number of your choice - Think step by step about which action advances your goal""" # User prompt: current state + history + options user = "" if self._history: user += "Recent history:\n" for obs, act in self._history[-self._history_length:]: user += f"- You saw: {obs[:100]}...\n- You did: {act}\n" user += f"\nCurrent situation:\n{agent_input.observation}\n" user += f"\nInventory: {agent_input.inventory}\n" user += "\nOptions:\n" options = {} for i, cmd in enumerate(agent_input.admissible_commands, 1): user += f"{i}. {cmd}\n" options[str(i)] = cmd return system, user, options ``` Key insights: - **Numbered options**: LLMs are better at "choose 1, 2, or 3" than generating exact command syntax - **History context**: Including recent moves prevents loops and enables multi-step plans - **Explicit rules**: Telling the model exactly what format to respond in reduces parsing failures ## Action Parsing: LLMs don't always follow instructions. The parser handles this with a fallback chain. Be sure to use error handling like so since an LLM's output is non-deterministic: ```python def _parse_action(self, response: str, options: Dict[str, str]) -> str: response = response.strip() # 1. Exact number match ("1" β first option) if response in options: return options[response] # 2. Exact command match (case-insensitive) for cmd in options.values(): if response.lower() == cmd.lower(): return cmd # 3. Substring match (response contains the command) for cmd in options.values(): if cmd.lower() in response.lower(): return cmd # 4. Word overlap (most words in common) response_words = set(response.lower().split()) best_match, best_score = None, 0 for cmd in options.values(): cmd_words = set(cmd.lower().split()) overlap = len(response_words & cmd_words) if overlap > best_score: best_match, best_score = cmd, overlap if best_match and best_score > 0: return best_match raise RuntimeError(f"Could not parse action from: {response}") ``` This handles everything from "1" to "I'll go north" to "take the shiny brass lamp" when the command is "take lamp." ## Results: With GPT-4o-mini and 3 steps of history, the agent solves simple TextWorld quests (finding and collecting objects) as expected. The random baseline does not work well by contrast, and also as expected. More importantly, the framework makes experimentation easy: - Swap `objective_mode: explicit` for `objective_mode: abstract` to see how agents handle vague goals - Change `history_length` to test memory requirements - Compare local models vs. cloud APIs - Collect trajectory data for fine-tuning later ## Some Takeaways Some notes that I would like to share. These are pretty obvious, but are still good to note since they pop up in some form or another when I build new AI Agents: **1. Type safety matters.** Preemptively catch bugs and verify your LLM's outputs. **2. LLMs are capable reasoners.** With good prompts and constrained action spaces, off-the-shelf models can play games competently. **3. Design for experimentation.** YAML configs, factory patterns, and clean interfaces mean you can iterate quickly. In general, just always try to move fast. # Hello World for AI Agents: Building Game-Playing Agents This is a Hello World example for building AI Agents. The point of the [TextWorld Agent Repository](https://github.com/DoKu88/textWorldAgent) is to show how to get started building AI Agents and their simple components. This is very similar to the reinforcement learning environments from [OpenAI Gym](https://gymnasium.farama.org/) to train reinforcement learning agents, where the standardized environment interface begets a standardized agent one to interact with it. I did not do any training for this project, since ChatGPT worked fantastic out of the box, and I have decided to focus on training for a harder AI Agent task. ## The Problem: Text Adventure Games Text adventure games present the player with a description of their environment and accept natural language commands: ``` You are in a small kitchen. There's a table with an apple on it. A door leads north to the living room. > take apple You pick up the apple. > go north You enter the living room... ``` These games used to be surprisingly hard for AI. They require: - Understanding natural language descriptions - Maintaining memory of what you've seen and done - Planning multi-step sequences toward goals - Selecting valid actions from a constrained set I used [TextWorld](https://github.com/microsoft/TextWorld), Microsoft's framework for procedurally generated text games, as the environment. The question was: how do you build an agent architecture that's clean, extensible, and actually works? ## The Core Insight: The Base Class is the Framework The heart of this project is a single abstract class: `BaseAgent`. It handles everything common to all agents: ```python class BaseAgent(ABC): """Abstract base agent for TextWorld using Pydantic I/O.""" def __init__(self, history_length: int = 0, objective_mode: str = "explicit"): self._history: List[tuple[str, str]] = [] self._history_length = history_length self._objective_mode = objective_mode @abstractmethod def act(self, agent_input: AgentInput) -> AgentOutput: """Subclasses implement this to select an action.""" pass def __call__(self, observation, score, done, info) -> str: """Callable interface matching TextWorld's expected signature. Convenience function.""" agent_input = AgentInput.from_textworld( observation, score, done, info, self._objective_mode ) output = self.act(agent_input) self._record_history(observation, output.action) return output.action ``` This design means: - **One method to implement**: New agents only need to define `act()` - **Shared infrastructure**: History tracking, prompt building, action parsingβall handled - **Consistent interface**: Every agent looks the same to the runner The base class provides sensible defaults for the hard parts. Building the prompt. Cleaning observations (removing ASCII art artifacts). Parsing LLM responses into valid commands. These are solved problems that our agents need. ## Type-Safe I/O with Pydantic One note is that using type safe inputs and outputs out of your agent enables more predictable behavior for writing and debugging code. Without guardrails, you may end up with `KeyError` exceptions and silent failures. Pydantic models fix this: ```python @dataclass class AgentInput: observation: str score: int done: bool admissible_commands: List[str] inventory: Optional[str] = None max_score: Optional[int] = None objective: Optional[str] = None @classmethod def from_textworld(cls, observation, score, done, info, objective_mode): """Factory method that handles all the TextWorld-specific weirdness.""" # Extract and validate fields... return cls(...) @dataclass class AgentOutput: action: str reasoning: Optional[str] = None confidence: Optional[float] = None ``` Every agent receives an `AgentInput` and returns an `AgentOutput`. ## The Inheritance Pattern: Strategy via Subclassing With the base class enforcing agents' frameworks, implementing new agents becomes trivial. Here are three agents I implemented: ### RandomAgent (Baseline) ```python class RandomAgent(BaseAgent): """Selects randomly from valid actions. The simplest possible agent.""" def act(self, agent_input: AgentInput) -> AgentOutput: action = random.choice(agent_input.admissible_commands) return AgentOutput(action=action) ``` ### TransformersAgent (Local LLM) ```python class TransformersAgent(BaseAgent): """Uses HuggingFace models for local inference.""" def __init__(self, model_name: str, **kwargs): super().__init__(**kwargs) self.tokenizer = AutoTokenizer.from_pretrained(model_name) # Auto-detect model type if "t5" in model_name.lower() or "flan" in model_name.lower(): self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name) else: self.model = AutoModelForCausalLM.from_pretrained(model_name) def act(self, agent_input: AgentInput) -> AgentOutput: system_prompt, user_prompt, options = self._build_prompt(agent_input) # Tokenize, generate, decode, parse... return AgentOutput(action=parsed_action) ``` ### AgentOpenAI (OpenAI API LLM) ```python class AgentOpenAI(BaseAgent): """Uses OpenAI's API for cloud-based inference.""" def __init__(self, model: str = "gpt-4o-mini", **kwargs): super().__init__(**kwargs) self.client = OpenAI() self.model = model def act(self, agent_input: AgentInput) -> AgentOutput: system_prompt, user_prompt, options = self._build_prompt(agent_input) response = self.client.chat.completions.create( model=self.model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ], temperature=0.0 ) action = self._parse_action(response.choices[0].message.content, options) return AgentOutput(action=action) ``` The pattern is clear: inherit from `BaseAgent`, call `_build_prompt()` to get a standardized prompt, run your inference, call `_parse_action()` to extract the command. ## The Factory Design Pattern: Configuration-Driven Agent Creation Hardcoding agent types is fragile and is non-scalable for adding new agents. The ubiquitous Factory Design Pattern is made to solve this issue: ```python class AgentFactory: _registry: Dict[str, Type[BaseAgent]] = {} @classmethod def register(cls, name: str, agent_class: Type[BaseAgent]): cls._registry[name] = agent_class @classmethod def create(cls, name: str, **kwargs) -> BaseAgent: if name not in cls._registry: raise ValueError(f"Unknown agent: {name}") return cls._registry[name](**kwargs) # Registration AgentFactory.register("random", RandomAgent) AgentFactory.register("transformers", TransformersAgent) AgentFactory.register("openai", AgentOpenAI) ``` Agent selection happens through configuration: ```yaml # config/agent.yaml agent: type: openai # Change this line to switch agents model: gpt-4o-mini history_length: 3 objective_mode: explicit ``` ```python # main.py agent_type = config["agent"]["type"] agent = AgentFactory.create(agent_type, **agent_kwargs) ``` Adding agents is as easy as implementing a new agent class and registering it. Note that keeping a register of your subclasses isn't strictly necessary, but is good for bookkeeping. ## Prompt Engineering I used the same prompt (enacted by the baseclass) for all the AI Agents. This was so I could have an apples to apples comparison, but this is a personal choice and different prompts work better on different LLM's obviously: ```python def _build_prompt(self, agent_input: AgentInput) -> tuple[str, str, Dict[str, str]]: # System prompt: establish the persona and rules system = f"""You are an expert text adventure game player. Your objective: {agent_input.objective} Rules: - Choose ONLY from the numbered options below - Respond with just the number of your choice - Think step by step about which action advances your goal""" # User prompt: current state + history + options user = "" if self._history: user += "Recent history:\n" for obs, act in self._history[-self._history_length:]: user += f"- You saw: {obs[:100]}...\n- You did: {act}\n" user += f"\nCurrent situation:\n{agent_input.observation}\n" user += f"\nInventory: {agent_input.inventory}\n" user += "\nOptions:\n" options = {} for i, cmd in enumerate(agent_input.admissible_commands, 1): user += f"{i}. {cmd}\n" options[str(i)] = cmd return system, user, options ``` Key insights: - **Numbered options**: LLMs are better at "choose 1, 2, or 3" than generating exact command syntax - **History context**: Including recent moves prevents loops and enables multi-step plans - **Explicit rules**: Telling the model exactly what format to respond in reduces parsing failures ## Action Parsing: LLMs don't always follow instructions. The parser handles this with a fallback chain. Be sure to use error handling like so since an LLM's output is non-deterministic: ```python def _parse_action(self, response: str, options: Dict[str, str]) -> str: response = response.strip() # 1. Exact number match ("1" β first option) if response in options: return options[response] # 2. Exact command match (case-insensitive) for cmd in options.values(): if response.lower() == cmd.lower(): return cmd # 3. Substring match (response contains the command) for cmd in options.values(): if cmd.lower() in response.lower(): return cmd # 4. Word overlap (most words in common) response_words = set(response.lower().split()) best_match, best_score = None, 0 for cmd in options.values(): cmd_words = set(cmd.lower().split()) overlap = len(response_words & cmd_words) if overlap > best_score: best_match, best_score = cmd, overlap if best_match and best_score > 0: return best_match raise RuntimeError(f"Could not parse action from: {response}") ``` This handles everything from "1" to "I'll go north" to "take the shiny brass lamp" when the command is "take lamp." ## Results: With GPT-4o-mini and 3 steps of history, the agent solves simple TextWorld quests (finding and collecting objects) as expected. The random baseline does not work well by contrast, and also as expected. More importantly, the framework makes experimentation easy: - Swap `objective_mode: explicit` for `objective_mode: abstract` to see how agents handle vague goals - Change `history_length` to test memory requirements - Compare local models vs. cloud APIs - Collect trajectory data for fine-tuning later ## Some Takeaways Some notes that I would like to share. These are pretty obvious, but are still good to note since they pop up in some form or another when I build new AI Agents: **1. Type safety matters.** Preemptively catch bugs and verify your LLM's outputs. **2. LLMs are capable reasoners.** With good prompts and constrained action spaces, off-the-shelf models can play games competently. **3. Design for experimentation.** YAML configs, factory patterns, and clean interfaces mean you can iterate quickly. In general, just always try to move fast. Comments (0) Please log in to comment. No comments yet. Be the first to comment! β Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!