2. Anatomy of an Agent

Authors: Priya & Stephen

Jun 21, 2026

Read Stephen's Preface to Agents Unpacked if you're new here.

You have used a large language model. You know the deal: a careful prompt gets a careful answer. A vague prompt gets a vague one. And the model itself does not keep anything from one conversation to the next, unless something external is holding that context for it.

Agents work differently. They have parts that do things a plain LLM does not. These parts are what make an agent an agent. It is not just the model underneath. It is the structure built around it that gives the system its abilities to persist, act, and keep going.

Understanding this structure is the second major shift in this series. The first shift is seeing that a chatbot can give you a good answer without finishing the job, because it stops after responding. The second shift is seeing that an agent is not a smarter model. It is a model placed inside a structure that gives it something to act with and somewhere to keep what it has done.

The Agent Formula

Most agents share the same basic parts:

A model (the LLM): the reasoning engine that understands language and decides what to do
Instructions: what tells the agent who it is, what it is for, and what ‘good’ looks like
Memory: a workspace or store that holds what has happened so far
Tools: capabilities the agent can call on to do things beyond generating text
An execution loop: the cycle of observing, deciding, acting, and checking

Different platforms package these differently. Some call memory “context,” some call tools “plugins” or “capabilities,” and some merge instructions and tools into a single configuration layer. But the parts are the same. An agent is not a single thing. It is a system, and each part matters.

Stephen: Don’t LLMs also have memory since they remember what happened earlier in the conversation? How’s this different?

Here is one distinction worth getting clear early: the context window and memory are not the same thing. The context window is the working space an LLM uses during a single session. It holds the conversation so far and gets loaded fresh every time the model gets a chance to speak. Memory, by contrast, is information stored outside the model, maintained by the system, and available across sessions and steps. We will come back to this.

An agent needs all its components:

Agent = Model + Instructions + Memory + Tools + Execution Loop

Leave any one of these out and the system changes behaviour in ways that matter. We will look at each piece in turn.

What the Model Does and What It Doesn’t Do

The model is the reasoning core. It reads your request, figures out what to do, and decides what to say back. It gets the most attention because it is the part that generates language.

But a model on its own is like a brilliant mind with no hands and no memory of its own. It can think. It cannot act. It cannot remember what happened five minutes ago unless something explicit carries that information forward.

Stephen: Wait a second. You say the model doesn’t remember what happened five minutes earlier. But when I use an LLM, it does seem to remember what happened earlier in the conversation.

Here is what is actually happening. When an LLM appears to remember earlier in a conversation, it is not the model itself that is remembering. The context window is carrying all the earlier messages along with your new message, every time you send something. The model sees the full conversation again and generates a response that fits what came before. That is not memory in the model. That is the system feeding the model a transcript.

This trips up almost everyone when they start using agents. The model generates text. The rest of the system decides what to do with that text and whether to act on it.

A better model helps. It reasons more clearly, follows instructions more faithfully, and handles edge cases better. But dropping a smarter model into an agent that is missing a working execution loop will not make it an agent. You need the other parts too.

Instructions: The Agent’s Direction

Instructions tell the agent what it is supposed to do and how to behave. Some systems call these system prompts. Others call them agent definitions or behavioural instructions. The name does not matter. What matters is that they are the layer that tells the model why it exists, who it is helping, and what ‘good’ looks like for the task at hand.

Good instructions do not make an agent smarter. They make it more focused. They give it a frame for every decision: what to prioritise, what to avoid, when to ask for help, how to present its output.

Stephen: Are these what are often called ‘skills’, or are skills something else altogether?

Skills and instructions are related, but they are not the same thing. Instructions are the core behavioural direction: who the agent is, what it is for, how it should approach its work. A skill, in platforms like OpenClaw and Hermes, is a specific file that tells the agent how to carry out a particular task, often by combining one or more tools. So instructions tell the agent how to behave generally. A skill tells it how to do something specific. We will see this distinction more clearly when we look at how different platforms implement these parts.

The instructions shape what the agent notices, what it proposes, what it tries, and what it says no to. Two agents built on the same model with different instructions will behave differently in the same situation. They will notice different things, prioritise differently, and produce different outcomes.

Poorly written instructions can quietly break an agent. If the instructions are vague, the agent has to improvise every step. If they contradict each other, the agent has to choose, and it might not choose the way you intended.

Stephen: Can you provide a few examples of what these instructions may look like in different scenarios?

Here is what instructions might look like in practice. A poorly-written instruction can quietly break an agent. Consider an instruction that says “be helpful and concise” without defining either term. When a user asks for a full technical breakdown, the agent has to arbitrate between two vague goals. It might give a two-sentence answer that technically satisfies “concise” but ignores “helpful,” or it might give an exhaustive response that satisfies “helpful” but ignores “concise.” Either way, the agent is improvising because the instructions gave it no real frame for the conflict.

A research assistant agent might have instructions that say something like: “You are a research associate working for [user name]. Your role is to find, summarise, and organise information on topics the user assigns. Always cite your sources. Flag uncertainty rather than guessing. Present findings in a clear brief, not a wall of text.”

A code review agent might have very different instructions: “You are a principled code reviewer. Focus on correctness, clarity, and performance. Do not praise code unnecessarily. When you find an issue, explain why it matters and suggest a concrete fix. Keep responses short.”

The difference between those two sets explains a lot about why two agents can feel like entirely different systems, even if they use the same model underneath.

Memory: The Workspace and Context

Memory in an agent is not like human memory. It is a structured store of information kept and updated as the agent works. It is what lets the agent hold a thread across multiple steps without starting from scratch each time.

Most agents use some combination of three types:

Working context — what is active right now: the current goal, what has been tried so far, what the user last said
Stored information — what the agent has been told about the user, their preferences, their past requests
Files and state — what exists in the workspace right now, what has been written or read recently

This is not a personality feature. It is not the agent “remembering” in the way a person remembers their childhood. It is operational continuity. The system maintaining a thread of relevant information across time and steps.

Different platforms handle these differently. LangChain agents build up a rolling context window: the current request gets appended to everything that happened before, and the whole thing is passed to the model. If the conversation gets long, older turns get dropped or summarised to make room. AutoGen agents can maintain shared memory across a team, so that when one agent finishes a task, what it learned is available to the next agent that picks up the thread.

OpenClaw takes yet another approach. Its memory layer is a structured store that agents write to and read from across sessions. When an agent starts a new session, it can query that memory store for relevant context rather than relying solely on what was in the most recent conversation. An agent can know that the user prefers short emails, even if that was established three weeks ago.

Stephen: If memory can be stored in files, does it mean that agents can have nearly unlimited memory (within the limits of the computer or server’s overall memory capacity)?

There are practical limits even when storage is effectively unbounded. The more relevant limit is not how much the agent can store, but how well it can find and use what it has stored. A full inbox is not the same as a well-organised one. Retrieval becomes harder as memory grows, and irrelevant information can dilute the signal if the system does not manage it carefully.

Think of it this way. A context window that holds 128,000 tokens can technically hold a lot of information. But it can only hold what was placed there. An agent with a large memory store full of useful context still needs a way to surface the right information at the right time. If it cannot find what it needs, or if what it finds is buried under noise, the effective memory is constrained.

The quality of retrieval matters as much as the quality of storage. An agent that retrieves relevant context poorly is effectively working with a much smaller memory than one that retrieves well, even if both store the same amount.

Stephen: So, tell me if I understood this. The agent has an index telling it where to find information specific to certain topics or tasks. When the LLM part of the agent decides it needs to deal with a certain topic, it uses the index to read and load the information from the memory file into its context. Is that right?

That is broadly right. The memory store, the index, and the retrieval into context are the key parts. One small correction worth noting: the decision to retrieve from memory is typically made by the agent or coordinator layer, not by the LLM directly. The LLM receives the retrieved content as part of its context, but it is the agent system that decides what to look up and when. This distinction matters because it is the agent layer, not the model, that is doing the memory management.

Stephen: But isn’t the agent’s brain the LLM? Clarify the distinction in your answer above. Which part of the agent’s infrastructure deals with this?

It is a fair challenge. The LLM is genuinely where the reasoning happens. It reads context, generates text, and makes decisions about what to say or do next. But it is also just a text processor. It receives input, produces output, and has no awareness of anything beyond the tokens it has been given.

The coordinator layer is the infrastructure that sits around the LLM and manages the process. It reads the LLM’s output, decides whether to act on it, calls tools, retrieves memory, and feeds results back into the next LLM call. It is the difference between the LLM thinking and the agent doing. A bare LLM generates text. The coordinator turns that text into action.

To use a rough analogy: the LLM is like a pilot who can read instruments and make decisions. The coordinator is like air traffic control — it decides which runway to use, when to land, and when to divert. The pilot’s brain does the reasoning. But without the infrastructure around it, the pilot just sits in the cockpit thinking.

So when we say the agent retrieves memory, we mean the coordinator retrieves it and places it where the LLM can see it. The LLM does not reach into a file and pull something out. The coordinator does that work and presents the result to the LLM as part of the next context.

Stephen: And are the bits of these files then loaded into the LLM’s context? Therefore, the more stuff is loaded from the memory files, the more the context fills up, affecting the rest of the conversation and cost, right?

Yes, exactly right. Memory retrieval feeds into the context window, which is the LLM’s working space for the current session. Every token that goes into the context window is a token the LLM processes and a token that costs something. Loading a lot of context from memory means less room for the conversation itself, and it means higher token usage on every call.

This is one of the practical engineering tensions in agent design. Loading more memory gives the agent more to work with, but it also makes each LLM call more expensive and slower. A well-designed agent retrieves only what is relevant to the current task, not everything it knows.

Tools: What the Agent Can Actually Do

Tools are the capabilities that let an agent act beyond generating text. The model decides to use a tool. The tool performs an action and returns the result to the model.

This was covered in Chapter 1 under “Tools Are the Hands.” Here it is worth noting that tools are also where agents differ most between platforms. Some agents come with a large built-in toolkit. Others can call external tools through open protocols. Some let you build custom tools. Others are more locked down.

What tools might an agent actually have? A research agent might be able to search the web and read files on your machine. A coding agent might run shell commands and read or write files. A calendar agent might check your schedule and send messages. The tool is the bridge between the model’s decisions and the world the agent is working in.

What matters is not how many tools an agent has, but whether the tools it has are the right ones for the tasks you want it to perform.

Different platforms implement tools differently. LangChain provides a standardised tool interface that lets you connect to search APIs, databases, file systems, and custom functions. OpenCode agents run inside a development environment, where the tools available are the commands and interfaces of that environment. OpenClaw uses an open tool protocol that lets agents call external capabilities regardless of who built them. Hermes takes a more composed approach: a skill file specifies not just what the agent should do, but which tools to use and in what combination to carry out a specific task.

Here is the thing worth unpacking. A tool on its own is just a capability. What makes it useful is the bridge between what the agent is trying to accomplish and the tool that can help. A calendar tool is useless if the agent does not know it should check the schedule. An agent running a meeting-preparation skill that says “check availability, send invites, prepare a briefing document” has that bridge built in.

The Execution Loop: The Part That Makes It an Agent

The execution loop is the cycle that takes an agent from a single-shot response to a sustained process. Observe, think, act, check, repeat.

This was the core of Chapter 1. But it is worth restating here, in the context of anatomy, because the loop is what ties all the other parts together. Without it, you have a model that receives instructions and context and produces text. With it, you have a system that can pursue a goal across time, recover from partial failures, and stop when the work is genuinely done.

The loop is the difference between an agent and a very well-instructed chatbot.

Here is why the repeat step matters so much. A model has no native sense of when it is done. When you call a function in code, the function returns and you are finished. When a model generates text, it produces tokens until it hits a stop condition built into the model itself, most commonly a token limit or a designated stop sequence. These conditions tell the model when to stop generating, but they do not tell the agent whether the result is actually what the user wanted. There is no built-in check that says “is this the right answer?”

The execution loop provides that check. The check phase asks: is the result good? Does it meet the original goal? If not, the loop continues. Sometimes that means a dozen or more cycles before a task is genuinely complete.

The loop also determines how goals decompose. In LangChain’s ReAct-style agents, the loop runs inside a single agent: observe, decide on the next action, execute it, check the result, repeat. In AutoGen, the loop is distributed across multiple agents that hand off to each other. A planner agent might coordinate specialist agents, each running their own loop on their own piece of the problem. OpenClaw uses a coordinator agent to manage the loop, assigning work to sub-agents and handling the check phase across the full task rather than within a single agent cycle.

The architecture of the loop is one of the most significant differences between agent platforms. But the function is the same everywhere: turning a sequence of isolated model calls into a coherent, goal-directed process.

Multiple Platforms: Comparing the Formula in Practice

It helps to see the same five-part formula playing out in different platforms. Here is how a few of them map onto it.

LangChain is one of the most widely-used agent frameworks. A LangChain agent has an LLM at its core, a set of tools, a prompt defining the agent’s role, memory that accumulates conversation history, and an agent executor that runs the loop. The loop in LangChain is explicit: the agent executor repeatedly calls the model, parses the model’s tool-call output, runs the tool, and feeds the result back until the model says it is done.

AutoGen takes a different approach. Rather than a single agent, AutoGen sets up a team of agents that communicate with each other. Each agent has a model, instructions defining its role, and its own set of tools. The loop is distributed: there is no single execution cycle. Agents exchange messages, delegate tasks to each other, and the overall process continues until the team has finished the assigned goal. Memory in AutoGen can be shared across agents so that one agent’s work is available to the next.

OpenClaw uses a coordinator agent that manages the overall execution loop. Sub-agents each have their own identity, tools, and memory. The coordinator decides which sub-agent handles which part of a task, passes context between them, and handles the check phase across the full goal. Skills in OpenClaw are files that tell a specific agent how to carry out a particular task, combining instructions about what to do with definitions of which tools to use.

Hermes also uses a skill-based architecture where skill files define both the instructions and the tool configuration for specific tasks. Rather than a single general-purpose agent, Hermes composes agents from skills that know how to use particular tools in particular contexts.

OpenCode works differently again. It runs agents inside a development environment, typically a cloud workspace. The tools available to the agent are the commands and interfaces of that environment. The loop is typically managed at the task level: the agent receives a task, works through it using the tools at its disposal, and reports back. There is less of a formalised multi-step loop and more of a task-completion focus.

None of these platforms invents new parts of the agent formula. They all use a model, instructions, memory, tools, and an execution loop. What differs is how those parts are implemented, how they are divided up, and how they communicate. Understanding the formula means you can look at any of these platforms and see what you are actually looking at.

What This Chapter Covered

This chapter pulled apart the five components of the agent formula.

We saw how the model is the reasoning core but cannot act or remember on its own. How instructions shape the agent’s focus and behaviour, and why the same model with different instructions can feel like a different system entirely. How memory provides operational continuity across steps and sessions, and why retrieval quality matters more than storage capacity. How tools extend what the agent can do beyond generating text, and why a tool is only as useful as the bridge between the model’s decisions and the action the tool can take. And how the execution loop is the architecture that turns isolated model calls into a coherent, goal-directed process.

We also saw how different platforms implement the same five components differently: LangChain’s explicit agent executor, AutoGen’s team-based coordination, OpenClaw’s coordinator and skill-based sub-agents, Hermes’s composed skill architecture, and OpenCode’s environment-integrated approach.

The goal was not to become an expert on any one platform. It was to show that agents are not mysterious black boxes. They are systems built from a small number of recognisable parts, and once you know what to look for, you can see the anatomy underneath any agent platform you encounter.

Next up in Agents Unpacked: we dig into tools and skills: what it actually means for an agent to do something rather than just say it, and why a well-tooled agent operating autonomously in a loop is a fundamentally different thing from a model answering questions.

<< Previous Post: From Answer to Outcome

>> Next Post: Coming Soon

Table of Contents

stephengruppetta.com

Discussion about this post

Ready for more?