EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena evaluates LLM agents in environments where terminal workflows, software repositories, and user preferences evolve over time. EvoMem augments agent memory with patch histories that preserve what changed, why it changed, and when old behavior still matters.

Jundong Xu1,* Qingchuan Li1,* Jiaying Wu1 Yihuai Lan2 Shuyue Stella Li3 Huichi Zhou4 Bowen Jiang5 Lei Wang3 Jun Wang4 Anh Tuan Luu6 Caiming Xiong7 Hae Won Park8 Bryan Hooi1 Zhiyuan Hu1,8
1National University of Singapore 2Singapore Management University 3University of Washington 4University College London 5University of Pennsylvania 6Nanyang Technological University 7Recursive 8Massachusetts Institute of Technology

*Equal contribution.

Overall model performance on EvoArena

Current agents struggle under persistent environment evolution, achieving 39.6% average accuracy across terminal, software, and social-preference domains.

Persistent Environment Evolution

Three Domains Where the Environment Keeps Moving

EvoArena covers executable workflow evolution, software evolution, and preference evolution. The shared challenge is version-aware reliability: agents must adapt to new conditions while preserving behavior that remains valid.

EvoArena benchmark overview

Terminal-Bench-Evo

Executable Workflow Evolution

Terminal tasks become discrete version chains. The end goal may stay fixed, but later releases change deployment mechanisms, served paths, permissions, branch policies, dependencies, or tests.

Terminal-Bench-Evo example

SWE-Chain-Evo

Software Evolution

Repository milestones are evaluated chronologically. Each new requirement is solved on top of the accumulated codebase state, with Pass-to-Pass tests checking that prior behavior still holds.

SWE-Chain-Evo example

PersonaMem-Evo

Social Intelligence Evolution

Long conversation histories contain implicit and evolving preference evidence. Agents must answer questions from the current preference state while distinguishing it from outdated but historically grounded evidence.

PersonaMem-Evo example

Dataset Composition

Domain and Question-Type Distribution

EvoArena combines complementary forms of evolution. The central chart shows the domain mixture, while surrounding panels break down the question or update types within each domain.

Step accuracy Accuracy averaged over individual evolved task instances.
Chain accuracy A stricter metric requiring every step in an evolution chain to be solved.
Distribution of EvoArena domains and question types

EvoMem

Patch-Based Memory Evolution

EvoMem keeps the base memory system intact, but adds an append-only patch history for meaningful memory changes. At inference time, agents retrieve both the latest memory and relevant patches.

Overview of EvoMem patch memory architecture
01

Record Changes

Capture non-additive updates, including previous state, new state, rationale, summary, and evidence.

02

Retrieve Versioned Evidence

Return relevant historical patches alongside current memory when queries depend on evolution.

03

Preserve Agent Interfaces

Instantiate the same abstraction for Terminus2, OpenHands, A-Mem, and Memento-Skill.

Results

Main Results

Table 3 reports EvoArena performance with step accuracy and chain accuracy. Table 4 shows that EvoMem also improves typical agent and long-horizon memory benchmarks.

Main Results on EvoArena

Benchmark Agent Model Step Base Step +EvoMem Step Δ Chain Base Chain +EvoMem Chain Δ
Terminal-Bench-EvoTerminus 2GPT-5.562.865.1+2.331.845.5+13.7
Gemini-3.1-Pro53.856.5+2.739.344.1+4.8
Kimi-K2.640.842.9+2.114.922.7+7.8
Deepseek-V4-Pro37.340.4+3.113.522.4+8.9
GLM-5.151.855.3+3.534.236.8+2.6
MiniMax-M2.741.042.4+1.418.219.5+1.3
Qwen3.6-27B37.640.9+3.311.117.3+6.2
Gemma4-31B23.424.5+1.19.012.4+3.4
Average43.646.0+2.421.527.6+6.1
SWE-Chain-EvoOpenHandsGPT-5.549.750.9+1.212.216.8+4.6
Gemini-3.1-Pro20.518.1-2.48.810.2+1.4
Kimi-K2.630.227.6-2.68.512.1+3.6
Deepseek-V4-Pro26.727.7+1.08.213.3+5.1
GLM-5.134.936.1+1.29.912.8+2.9
MiniMax-M2.741.442.3+0.914.715.3+0.6
Qwen3.6-27B11.611.6+0.012.210.1+2.1
Gemma4-31B8.512.0+3.55.26.3+1.1
Average27.928.3+0.410.012.1+2.1
PersonaMem-EvoA-MemGPT-5.540.043.8+3.837.541.2+3.7
Gemini-3.1-Pro46.448.3+1.938.840.8+2.0
Kimi-K2.651.555.5+4.040.250.0+9.8
Deepseek-V4-Pro47.951.6+3.740.447.4+7.0
GLM-5.150.447.5-2.942.538.9-3.7
MiniMax-M2.747.547.9+0.440.941.4+0.4
Qwen3.6-27B43.544.4+1.036.539.9+3.4
Gemma4-31B51.152.9+1.843.545.8+2.3
Average47.349.0+1.740.043.2+3.2

Main Results on Typical Benchmarks

Benchmark Agent Model Base +EvoMem Δ
GAIAMemento-SGPT-5.583.083.0+0.0
Gemini-3.1-Pro57.065.0+8.0
Gemma4-31B45.054.0+9.0
Deepseek-V4-Pro70.080.0+10.0
GLM-5.170.077.0+7.0
Qwen3.6-27B70.075.0+5.0
Average65.872.3+6.5
LoCoMoA-MemGPT-5.532.933.9+1.0
Gemini-3.1-Pro21.128.6+7.5
Gemma4-31B52.355.2+2.9
Deepseek-V4-Pro52.056.5+4.5
Kimi-K2.654.057.7+3.7
Qwen3.6-27B26.026.3+0.3
Average39.743.0+3.3
EvoArena: EvoMem improves average step accuracy across all three evolving domains.
Chain consistency: EvoMem improves chain-level accuracy by +6.1% on Terminal-Bench-Evo, +2.1% on SWE-Chain-Evo, and +3.2% on PersonaMem-Evo.
General benchmarks: EvoMem also improves GAIA by +6.5% and LoCoMo by +3.3% on average.

Analysis

When and Why Does EvoMem Help?

The mechanism analyses indicate that EvoMem is useful when patch history becomes operational: agents retrieve relevant transitions, preserve historical constraints, and recover complete evolving evidence.

EvoMem helps when retrieved transitions are operationalized.

On Terminal-Bench-Evo, gains rise from +2.6% to +8.3% when patch uptake is nonzero, showing that historical transition evidence helps most when it changes the agent's plan or commands.

EvoMem reduces regressions across backbones.

On SWE-Chain-Evo, average PASS_TO_PASS failure rates drop from 9.09% to 6.32%, indicating better preservation of behavior introduced by earlier repository milestones.

EvoMem helps most when reasoning requires temporal or dispersed evidence.

On PersonaMem-Evo, temporal trajectory and multi-pattern synthesis questions gain +5.2%, matching the settings where a single consolidated memory state is most likely to lose evidence.

Patch histories improve complete evidence preservation.

EvoMem improves row-level preference evidence capture from 72.5% to 74.9%, with the largest capture gains on temporal trajectory and multi-pattern synthesis.

PersonaMem-Evo efficiency.

Higher token usage does not reliably translate into higher accuracy: Kimi K2.6 and Gemma4-31B-it are strong while using fewer tokens than the cross-model average.

Terminal-Bench-Evo efficiency.

GPT-5.5 has the highest terminal accuracy but uses far more tokens, while Gemini 3.1 Pro and GLM-5.1 remain strong with much lower token budgets.

Accuracy versus token usage across evaluated models

The efficiency analysis shows that token usage is not a reliable proxy for capability. On PersonaMem-Evo, Kimi K2.6 and Gemma4-31B-it reach strong accuracy with below-average token usage, while GPT-5.4-mini and GPT-5.5 consume more tokens without matching their accuracy. On Terminal-Bench-Evo, GPT-5.5 achieves the strongest accuracy but at a much higher token cost, whereas Gemini 3.1 Pro and GLM-5.1 remain competitive with substantially smaller token budgets. This suggests that evolving-agent benchmarks should report accuracy and inference efficiency together.

Citation

BibTeX

@article{xu2026evoarena,
  title     = {EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments},
  author    = {Jundong Xu and Qingchuan Li and Jiaying Wu and Yihuai Lan and Shuyue Stella Li and Huichi Zhou and Bowen Jiang and Lei Wang and Jun Wang and Anh Tuan Luu and Caiming Xiong and Hae Won Park and Bryan Hooi and Zhiyuan Hu},
  year      = {2026},
  journal   = {arXiv preprint arXiv:2606.13681,
}