EVOTEST is an evolutionary test-time learning framework that improves an agent across episodes without gradients or fine-tuning. It evolves the entire agentic system between attempts by rewriting prompts, updating cross-episode memory, tuning hyperparameters, and refining tool-use routines.
This repository also provides J-TTL (Jericho Test-Time Learning), a benchmark setting where an agent plays the same text adventure game for multiple consecutive episodes and must improve using only within-session experience.
- Title: EVOTEST: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
- Authors: Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi
- Affiliations: National University of Singapore, Microsoft Research
- Benchmark (J-TTL): Measures on-the-fly learning across repeated episodes of the same Jericho game.
- Method (EVOTEST): Evolves prompts, code-based state extractors, cross-episode memory, and hyperparameters after each episode—no training required.
- Results: Consistent improvements across games, outperforming reflection-, memory-, and gradient-based online methods; uniquely achieves wins on Detective and Library in our evaluations.
main.py: Entry point for running agents and evaluations.src/:evaluation.py: Evaluation loop over episodes; logging and metrics.env.py: Jericho environment wrapper usingFrotzEnv.our_agent.py: EVOTEST Actor/Evolver agent with cross-episode memory and evolutionary prompt/code updates.summary_agent.py: Agent variant using LLM-generated summaries.memory_agent.py,rag_agent.py,naive_agent.py: Baseline agents (memory-only, RAG-enhanced, and naive).openai_helpers.py: OpenRouter/OpenAI client with retry and token utilities.utils.py: Small helpers (e.g., ROM file resolution).
jericho-games/: Game ROMs directory (e.g.,zork1.z5,detective.z5, etc.).*.qzl: Pre-bundled game files for convenience in some setups.test_*.py: Pytest-based checks for naive, RAG, and RAG embeddings agents.requirements_rag.txt: Minimal dependencies for RAG agent; see install notes below for full setup.
- Python 3.10+ recommended.
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # on macOS/Linux
# .venv\Scripts\activate # on Windows PowerShellInstall minimal requirements, then Jericho and supporting libs used by the agents:
pip install -r requirements_rag.txt
pip install jericho tiktoken python-dotenvIf you plan to use RAG with larger embedding backends or vector indices:
# optional extras
pip install sentence-transformers faiss-cpuNote: On Apple Silicon, faiss-cpu wheels may vary; if issues arise, skip FAISS or install via conda-forge.
Create a .env file in the project root with your OpenRouter key (get it from https://openrouter.ai/settings/keys):
echo 'OPENAI_API_KEY="sk-..."' > .envAll LLM calls route through OpenRouter by default using openai_helpers.py (OpenAI SDK with custom base URL). Some OpenAI models on OpenRouter may require additional consent on the OpenRouter dashboard.
Jericho requires valid Z-machine game files (e.g., .z5). This repo includes a jericho-games/ folder. Ensure the game ROM you want to evaluate exists there. You can also point --rom_path to a custom directory.
Run EVOTEST on the Detective game for 10 episodes using a fast model:
python main.py \
--game_name detective \
--rom_path jericho-games/ \
--agent_type our \
--llm_model google/gemini-2.5-flash \
--eval_runs 10 \
--env_step_limit 110 \
--llm_temperature 0.4 \
--evol_temperature 0.7Change --llm_model to any OpenRouter-supported model (e.g., openai/gpt-4o-mini, anthropic/claude-4-sonnet-20250522).
Episodes, summaries, and metrics are written to:
output/<game>/<agent_type>/<model>/<timestamp>/
Each episode log includes step-by-step observations, chosen actions, rewards, and cumulative scores.
- our: EVOTEST with evolutionary updates (prompt and code state extractor), optional cross-episode memory, UCB-based node selection, and auto-freeze on wins.
- memory: Memory-only baseline with recent context window.
- summary: Uses an LLM to summarize progress and feed it into action selection.
- rag: Retrieves similar prior states/actions (cross-episode positives) to guide decisions.
- naive: Minimal baseline issuing generic exploratory actions.
Select with --agent_type {our,memory,summary,rag,naive}.
- Whole-system evolution between episodes: After each run, an Evolver LLM proposes a revised guiding prompt and a Python state extractor (
extract_state(game_history)) specialized to the current game. - Cross-episode memory (optional): Stores successful (state → action, +Δscore) and negative loop patterns to encourage progress and avoid plateaus.
- UCB-driven exploration vs. exploitation: Chooses which evolved node to try; can auto-freeze on detected wins and resume evolution if a frozen prompt later fails.
- Tooling: Explicit control over temperatures for acting, summarization, RAG, and evolution.
Relevant flags in main.py:
--evolution_llm_model: Model for the Evolver (defaultopenai/o3-2025-04-16).--freeze_on_win,--win_freeze_threshold,--force_best_after_drop,--drop_threshold.--enable_cross_mem: Toggle cross-episode memory and negative-contrast evolution.--exploration_constant,--depth_constant: UCB and depth decay controls.
Example (50 episodes for statistics, EVOTEST on Detective):
python main.py \
--game_name detective \
--rom_path jericho-games/ \
--agent_type our \
--eval_runs 50Switch games (e.g., library, zork1) and agents via --game_name and --agent_type. Use a consistent --seed for reproducibility.
Run the included tests (naive, RAG, embeddings checks):
pytest -q- "ModuleNotFoundError: jericho":
pip install jericho(or install via conda if needed). - OpenRouter errors/rate limits: ensure
.envhasOPENAI_API_KEY, and the selected--llm_modelis enabled on your OpenRouter account. - FAISS install on macOS: prefer
faiss-cpuvia conda-forge or skip FAISS if not using large RAG indices. - No score improvements: try increasing
--eval_runs, enable--enable_cross_mem, or use a stronger--evolution_llm_model.
If you find this repository useful, please cite the paper:
@article{he2025evotest,
title={EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems},
author={He, Yufei and Liu, Juncheng and Liu, Yue and Li, Yibo and Cao, Tri and Hu, Zhiyuan and Xu, Xinxing and Hooi, Bryan},
journal={arXiv preprint arXiv:2510.13220},
year={2025}
}This project is licensed under the terms of the LICENSE file in the repository.
- Jericho: the interactive fiction environment used in J-TTL.
- OpenRouter: unified interface for accessing multiple foundation models.