Research explainer · May 2026

Continuous learning for agents is splitting into two worlds.

The first world updates tokens: memories, prompts, procedures, retrieved context, skills, traces, and distilled lessons. The second updates parameters: LoRA adapters, fast weights, test-time training state, online RL checkpoints, and model edits. The near-term winner is token space. The long-term prize is a safe bridge between the two.

+8.3 pts

ReasoningBank gain on WebArena vs memory-free ReAct.

91.6

Mem0-reported LoCoMo score in its 2026 memory report.

1.5–2 hr

Cursor’s online RL loop from rollout to fresh training data.

0.596 → 0.905

Sleep-style consolidation gain in one sliding-window setup.

Core thesis

Agents do not become continuous learners just because they remember more.

A long context window is a bigger desk, not a better mind. The hard problem is deciding what to preserve, what to rewrite, what to forget, when to retrieve it, and when the accumulated evidence is strong enough to change behavior.

Today’s useful systems mostly learn in token space. They externalize experience into text, structured records, graphs, skills, or procedures, then condition future model calls on that learned context. This is inspectable, portable, deletable, and compatible with closed frontier models.

Parameter-space learning is more powerful in principle, but harder to govern. Weight updates can compress experience into behavior, reduce inference-time context costs, and create genuine skill acquisition. They also introduce catastrophic forgetting, privacy leakage, opaque regressions, poisoning risk, and deployment complexity.

The map

Two axes, four useful regimes

Most “continuous learning” claims are really one of these four. Naming the regime prevents a lot of sloppy thinking.

Where does learning live?
Token-space, externalized
Parameter-space, internalized
Per-agent / personal

Personal memory agents

Files, vector stores, temporal graphs, user preferences, episodic summaries, self-edited instructions.

Examples: Letta/MemGPT, Mem0, Zep, Claude Code/OpenClaw memory.

Personal adapters

User- or org-specific LoRAs, model edits, private online tuning, fast weights for an individual agent.

Status: compelling but risky and rarely production-grade.
Population / product

Scaffold learning

Prompt evolution, workflow memory, eval-driven policies, tool routine updates, learned memory controllers.

Examples: ReasoningBank, EvoTest/J-TTL, AutoTTS, AgeMem.

Online model improvement

Continuously trained shared checkpoints from aggregate behavior and reward signals.

Example: Cursor Tab’s online RL loop.

Part I

Continuous learning in token space

The agent learns by changing the context it sees next time. The base model stays frozen, but the effective program changes.

1. Memory as state

Facts, preferences, episodes, prior failures, project state, and task history persist outside the model. Retrieval decides which pieces enter the next context window.

2. Reflection as compression

The agent turns raw trajectories into lessons. The important move is not storing the log. It is distilling reusable tactical knowledge from success and failure.

3. Context as policy

Prompts, skills, AGENTS.md files, tool descriptions, and operating procedures are all policy. Updating them changes behavior without touching weights.

The token-space stack

The clean architecture is a write → manage → read loop. The write path extracts candidate memories from interactions. The manage path deduplicates, summarizes, timestamps, resolves contradictions, scores usefulness, and sometimes deletes. The read path retrieves and ranks memories for the current task, then injects them into context in a form the model can use.

The field is moving from “store the conversation” to “store the lesson.” Google’s ReasoningBank is the crisp example: it extracts structured reasoning memories from both successful and failed trajectories, then retrieves them during future tasks. Google reports +8.3 points on WebArena, +4.6 on SWE-Bench Verified, and fewer execution steps versus memory-free ReAct. That is not weight-space learning. It is memory-mediated test-time adaptation, and it is probably the most practical near-term path.

ReAct

Reasoning traces become short-horizon working memory.

Reflexion

Verbal self-critiques after failed attempts become future guidance.

Voyager

Environment feedback becomes an executable skill library.

MemGPT / Letta

The context window becomes RAM. External memory becomes disk. The agent manages paging.

ReasoningBank

Raw trajectories become titled strategic lessons, including lessons from failure.

AgeMem

Memory operations become learned tool actions optimized with RL.

Additional systems to keep in the frame

Reflexion is the clean minimal example: after a task attempt, the agent writes a verbal reflection and conditions the next attempt on it. Voyager shows the stronger version in Minecraft: experience becomes executable skills, not just prose. Generative Agents shows the social-simulation version: observations become memories, important memories become reflections, reflections feed plans.

MemGPT / Letta is the operating-system version: context is RAM, archival storage is disk, and the agent learns to page memories in and out. Mem0 and Zep/Graphiti push the production version: extraction, consolidation, temporal validity, and graph-structured recall. Titans, LongMem, and SSM-style architectures move part of this from app-layer memory toward model-side memory modules.

Token-space mechanisms

What “learning” actually means in the current systems

MechanismWhat changesStrengthFailure mode
File memoryMarkdown instructions, logs, rules, project notes.Simple, inspectable, surprisingly competitive for one-user agents.Context limits, no semantic search, no temporal reasoning.
Vector memoryEmbedded snippets retrieved by similarity.Good fuzzy recall over large history.Similarity is not truth. Retrieves plausible but wrong memories.
Temporal graph memoryEntities, relations, valid time, transaction time.Better for changing facts and audit-heavy use cases.Extraction quality and schema drift become bottlenecks.
Reflection memorySummaries, lessons, anti-patterns, strategies.Turns experience into reusable policy.Self-reflection can hallucinate lessons or overfit one case.
Skill / prompt evolutionProcedures, tool routines, heuristics, config.Behavior changes are explicit and portable.Stale instructions silently steer agents wrong.
Learned memory policyThe agent learns when to store, retrieve, summarize, forget.Moves beyond hand-coded triggers.Hard to evaluate, easy to reward-hack.

Part II

Continuous learning in parameter space

This is the stronger form: experience changes the model or model-adjacent trainable state. It is also where the safety and eval bills come due.

Slow parameters

Base-model weights or adapters are updated through fine-tuning, RL, model editing, or continual training. The update persists across tasks and usually across users or deployments.

Fast parameters

Temporary weights or recurrent state update during inference. Examples include fast weights in SSM blocks, test-time training layers, and sleep-style consolidation before KV cache eviction.

Why it is hard

The standard continual-learning problem is catastrophic forgetting: the update that helps today’s task damages yesterday’s capability. LLM agents add three nastier problems. First, the learning signal is sparse and messy. Second, user data is private and often cannot be mixed into shared weights. Third, regressions are opaque. A bad memory entry can be inspected and deleted. A bad weight update needs an eval suite to even notice.

LoRA made parameter-space personalization more plausible by freezing the base model and training small low-rank adapters. The original LoRA paper reports up to 10,000× fewer trainable parameters and roughly 3× lower GPU memory than full GPT-3-scale fine-tuning. But LoRA is mostly an offline adaptation tool. The unsolved agent question is how to make adapter updates continuous, safe, reversible, private, and eval-gated.

Online RL

Cursor Tab is the strongest production signal: hundreds of millions of daily requests, frequent rollout of new checkpoints, 21% fewer suggestions, and 28% higher accept rate.

Test-time training

TTT adapts part of the model during inference. Large Chunk TTT pushes fast-weight updates over 2K–1M token chunks and scales state capacity dramatically.

Sleep / fast weights

Offline recurrence converts recent context into SSM fast weights before clearing KV cache, shifting reasoning compute to consolidation time.

Parameter-space mechanisms

The technical families to watch

FamilyLearning unitBest useMain unresolved problem
Continual fine-tuningBase weights or adaptersDomain adaptation from many verified examples.Forgetting, privacy, and expensive regression testing.
LoRA / QLoRASmall low-rank adaptersVersioned per-domain, per-user, or per-org learning.Adapter routing, merging, drift, and poisoning.
Model editingTargeted factual/behavioral editsTool drift, factual corrections, narrow durable updates.Edit locality and sequential edit stability.
Test-time trainingTemporary inference-time weightsSession-local adaptation from current context or verifier signals.Latency, hardware utilization, and safe objective design.
Fast weights / SSM stateRecurrent matrix/state updated as tokens streamLong-context compression and memory beyond KV cache.Turning compressed state into reliable reasoning support.
Sleep consolidationAdapters, fast weights, or memory modules updated offlineBackground learning after interaction, before next deployment.Choosing what consolidates and proving nothing broke.

Governance

The safety boundary moves with the learning substrate

Token-space risks

  • Memory poisoning through prompt injection.
  • Stale facts treated as current.
  • Retrieval of plausible but irrelevant memories.
  • Unreviewed durable writes from untrusted content.
  • Context rot from loading too much history.

Parameter-space risks

  • Catastrophic forgetting and hidden regressions.
  • Private data leakage through weights or adapters.
  • Irreversible or hard-to-debug behavioral drift.
  • Reward hacking in online RL loops.
  • Weak evals approving harmful updates.

Rule of thumb: token-space learning needs provenance and lifecycle control. Parameter-space learning needs eval gates, rollback, privacy boundaries, and canary deployments.

Research agenda

Cutting-edge experiments worth trying

These are designed to separate “better recall” from actual compounding improvement.

01

Memory ladder benchmark

Run the same agent over 30–100 related tasks. Compare no memory, append-only memory, reflection memory, temporal graph memory, and learned memory policy. Score success, cost, latency, repeated-error rate, stale-memory errors, and deletion compliance.

02

Failure-derived strategy bank

Build a ReasoningBank-style system for coding or browser agents. Force the bank to store negative lessons from failed trajectories, then test whether it prevents repeated mistakes under distribution shift.

03

Token-to-adapter distillation

Let an agent accumulate token-space memories for a domain, then periodically distill stable lessons into a small LoRA adapter. Evaluate whether the adapter reduces retrieval cost without losing auditability. Keep the original memory as the source of truth.

04

Sleep-time consolidation for agents

After a long task, run an offline “sleep” job that writes three artifacts: episodic summary, semantic facts, and procedural rules. Compare blocking extraction, background extraction, and multi-pass consolidation.

05

Personal adapter with rollback

Train a per-user adapter from approved memories only. Use strict holdout evals, red-team prompts, and rollback snapshots. Measure personalization gains versus memory-only prompting.

06

Memory poisoning harness

Seed malicious webpages, images, emails, or tool outputs that try to write durable memories. Test provenance filters, confirmation thresholds, conflict checks, and audit logs.

07

Agentic memory controller

Expose store, retrieve, summarize, forget, and pin as explicit actions. Train or prompt a controller to decide when to use them. Compare hand-coded triggers against learned policies.

08

Fast-weight simulation

Use a small open model with TTT/LoRA-style temporary updates on long synthetic tasks where the raw context is later removed. Separate memory load from reasoning depth, as the sleep paper does.

Recommended architecture

A practical continuous-learning stack for 2026

1

Hot context

Small always-loaded constitution: identity, rules, current projects, tool boundaries, known failure modes.

2

Memory store

Hybrid storage: files for inspectable memory, vector search for fuzzy recall, temporal graph for entities and changing facts.

3

Reflection layer

Background jobs that turn traces into facts, episodes, and procedural lessons. Keep provenance.

4

Eval loop

Task suites, replay logs, regression tests, memory-poisoning tests, cost and latency dashboards.

5

Optional adapter layer

Only after token-space learning proves stable. Distill approved, repeated lessons into reversible adapters.

Source trail

Primary sources and useful anchors

This site combines live source review with Boswell’s local knowledge base on agent memory, context engineering, and agent infrastructure. Vendor benchmark claims are treated as directional unless independently reproduced.