Agent engineering is an emerging discipline. Two significant patterns have crystallized over the past year: the agent harness and agent skills.
Agent skills are sometimes treated as an organizational convenience. We see them differently at ReadyLoop.ai; skill-based agents are a fundamental architectural shift, separating composable and reusable business logic from the agent harness. They are the .exe, .jar, or .ipa of the agent era. In this post I’ll lay out where agent architecture stands in mid-2026, and what is needed from an enterprise platform for skill-based agents. I’ll also explore some of the gaps that exist in today’s agent platforms; these gaps are what motivated us to build ReadyLoop.
Agent loops, harnesses, and platforms
Highly capable agents are part of our lives in 2026. They are useful across a range of daily knowledge work tasks (e.g., coding, research, data analysis, report generation). We have general-purpose agents like Codex, Cowork, Manus, or Perplexity; many domain-specific and specialized agents also exist, for example, Google Cloud Assist or Sierra.ai’s customer service agents.
The term agent has been used in AI going back decades. It also has some questionable uses in contemporary discourse. We’ll take Simon Willison’s succinct definition, capturing the essence of what agents are today:
An LLM agent runs tools in a loop to achieve a goal.
This is pretty much the ReAct pattern, but is also arguably what’s going on in a coding agent like Claude Code. There may be parallel sub-agents, persistent memory, MCP, A2A, RAG, and plugins, but these can all be viewed as tool use. Distinct phases such as planning can be structured through tools or by embedding the agent in a workflow graph. An agent execution history (also called trajectory) is then simply a trace of interleaved LLM and tool calls.
An agent harness is the agentic runtime for the agent loop, providing:
-
Tools and an execution environment. The execution environment is where the agent runs and what its tools interact with. It includes things like a sandbox, file system, web browser, shell, and Python interpreter. Harness tools can be built-in (e.g., web search), or part of the execution environment (e.g.,
grep). Tools can read from and write to files to implement memory across sessions. They can invoke other agents, for example an agent-as-tool sub-agent or MCP tool call. -
Context management. Prompt context is supplied to each LLM call. Getting the most out of a given model requires managing this limited context window efficiently. The model should be supplied with only the relevant information and no more. Context management is particularly important in long-horizon agents.
Naively, at the end of each iteration of the agent loop, LLM output (including tool arguments) and tool results are appended to the context. Over multiple iterations, the context window is polluted with stale or misleading information. As the prompt context grows, model inference efficacy drops, and, at some point, you hit the absolute or effective limits of the model context window.
Context engineering is the discipline of managing prompt context as a resource. Techniques include compaction, persisting plans and notes outside the context, tool history pruning, swapping older context to the file system, progressive disclosure of prompts and tools, K-V cache optimization, sub-agents, and so on. Anthropic and Manus have published helpful articles on context engineering (see also this video interview).
Implementing context engineering well requires a lot of debugging, feedback from evals, and production burn time. This differentiates a polished agent harness from the harness you can hack up in an afternoon.
-
Session management. Some agent interactions are one-shot, a single turn. Others are multi-turn. A user will request a report from a research agent and then, after the agent loop completes and returns the report, the user may ask a follow-up question. Sessions have state and history; they can be checkpointed and resumed, forked, and even teleported (live migrated). With some harnesses, sessions can be launched non-interactively, via a webhook or a
cron-like scheduler. -
Observability. This includes logs, traces, and metrics related to the agent loop, i.e., model calls and tool invocations. The OpenTelemetry-based Open Inference conventions are an example of the information tracked in an agent trace.
-
Guardrails. The agent harness guardrails extend the model-serving guardrails to the agent problem domain. You probably want to catch the
rm -rfof some directory as a deterministic guardrail and have a human in the loop on this tool call. A basic guardrail that any agent harness will implement is limiting the number of iterations in the agent loop to bound runaway loops and the resulting latency and cost. -
API / UI. Somebody or something, somewhere needs to interact with the agent. Harnesses are often integrated into an application or platform via SDKs or API calls. Some platform-light harnesses will directly provide a terminal shell (e.g., Gemini CLI) or IDE (e.g., Cursor).
The pi coding agent is an example of a minimal agent harness and has a well-written post describing its design history and architecture.
Agent harnesses are a big deal right now. Harness differences explain dramatic differences in frontier model performance on benchmarks and leaderboards. Choices around tooling, execution environments, and context engineering matter.
The agent harness is part of the agent platform. The platform provides what’s missing to make the agent harness useful in production. Some platform aspects reflect traditional production engineering concerns: identity management, deployments, billing, autoscaling, monitoring, and governance. Other aspects are agent-specific, such as support for evals. Agent evals are what separate agent building from an exercise in mysticism. You will be glad you have them when you change a single word in the system prompt and you catch a major regression before it impacts users in production. Another example of agent-specific platform concerns is the gateway, which needs to support agent-specific protocols such as MCP and A2A.
Putting all this together, many pieces go into making an agent platform:
Agent harnesses are sometimes confused with agent frameworks (e.g., LangChain, Crew.ai, Autogen, Pydantic AI), which are also used to build agents. Agent frameworks are just libraries and abstractions for building an agent loop and harness from scratch. You don’t need an agent framework to build an agent; you can write a while loop calling an LLM API or you can use an agent harness. Given how good off-the-shelf agent harnesses have become, if I found myself reaching for an agent framework, I’d ask whether I should be using an existing general-purpose agent harness instead.
agent_skills.exe
Agent skills are another recent shift in agent architecture. They provide agent harness extensibility, similar to other extensibility mechanisms like MCP. A skill-based agent architecture provides a needed separation of the agent business logic (e.g., how to research and generate a podcast with today’s market analysis) from the mechanics of the agent harness (e.g., compaction or session persistence). Earlier agents built on frameworks or model provider SDKs mixed these concerns together, like how font size was mixed with HTML in the early web before the arrival of CSS.
Agent skills are a reusable and composable bundle of LLM context and deterministic code; they are a first-class unit of agentic capability. Anthropic originally developed agent skills; an open format for them is now specified at agentskills.io.
An agent skill consists of:
-
A SKILL.md file that provides both the metadata (front matter) that describes the skill and the prompt context that is added when the skill is loaded. This is analogous to a system prompt for the skill.
-
Scripts that are invoked by the agent as tools. A script can fetch the weather, write to a database, or even invoke some other agent-as-tool. Scripts are an effective way to supply domain-specific tool extensibility; a minimal general-purpose agent harness only needs a few tools to read/write files and execute scripts; the rest of the tool code can then be contained in the skill scripts. Scripts are not just simple utilities; they can have complex dependencies (e.g., via PEP 723) and substantial implementation complexity. Our D&D Dungeon Master skill from the last post had over 20k lines of code (!).
-
References and assets: additional Markdown, CSV tables, or images. These resources are subject to progressive disclosure, or read on demand from scripts.
Here’s an example SKILL.md fragment:
---name: acme-ordersdescription: >- Look up an Acme order's current status, line items, and shipping ETA against the internal Orders API. Use when a CSR asks for an order summary by ID ("status of ACME-12345", "where is order 98765").---
# Acme order lookup
Fetch a single Acme order by ID and produce a short summary:status, line items, total, and shipping ETA. One order ID perinvocation; if the user names several, loop the call.
## How to fetch
Run `scripts/lookup.sh <ORDER_ID>` to retrieve the raw JSON recordfrom the Acme Orders API. The script handles auth via`$ACME_API_TOKEN` and retries on 5xx. Pipe the result through`scripts/format.py` to render the customer-facing summary.
## Decoding the result
Status codes returned by the API are not human-readable. Map themvia `references/status-codes.md` before reporting back. Carriercodes in the `shipping.carrier` field are listed in`references/carriers.csv`. For escalation rules by region, see`references/escalation-policy.md`.
...Skills are organized around the principle of progressive disclosure; the agent harness initially only loads the SKILL.md front matter. The agent then loads the rest of the SKILL.md, scripts, or additional resources as the LLM determines they are contextually relevant.
Some agent skills will be highly portable across different agent harnesses and platforms. For example, a Markdown-only style guide review skill should work across many different agent environments. The agent skills specification allows for various harness-specific choices, such as which languages to support for scripts, or how a runtime will treat persistence of skill data on the file system. While agent skills are aspirationally portable, in reality this is not write once, run everywhere.
Agent skills have exploded in popularity recently. Software engineers have been building skills since support appeared in Claude Code, capturing common workflows that benefit from reuse across teams and projects, e.g., code review or frontend testing. LegalZoom’s stock dropped 20% when Anthropic released their legal skill plugin for Cowork. ClawHub has many thousands of skills. Popular skills teach the agent how to work with Slack, the xAI API, Trello, or Polymarket. A significant number of the skills on ClawHub are malware, reflecting both the immaturity of the skill ecosystem and the weak security model of OpenClaw.
Agent skills make agent building approachable to a broad range of knowledge workers. You don’t need to be a coder to generate a reusable SKILL.md capturing your task triage workflow. You can get useful skill scripts from a general-purpose agent like Cowork; LLMs excel at writing code. There’s a continuum of skill complexity from the equivalent of a short shell script to a state-of-the-art research agent. At the further end of the spectrum, building production-ready agent skills requires investment in architecture, versioning, testing, and extensive evals.
Enterprise-grade, production-ready, skill-based agents
Agent platforms, whether consumer-oriented or enterprise-focused, have some table-stakes features. We enumerated them earlier; the platform should support identity, rollouts, billing, evals, monitoring, and so on.
There are some additional agent platform properties that speak to enterprise concerns specifically:
-
Model agnosticism. Frontier model providers are behind some of the leading agent harnesses and platforms. Many enterprises embrace this combination: leading frontier model capability paired with an opinionated, vertically integrated platform. It’s also usual for enterprises to prefer not to be locked into a single model provider, or want to host their own models on-premise. Reasons for this include data sovereignty, IP protection, model fine-tuning, and multisourcing for cost and reliability. Critical line-of-business agents should not fail at the next Claude outage.
-
On-premise hosting. SaaS is the starting point for almost all agent platforms, with OpenClaw a notable exception. Enterprises often seek an on-premise option as well for internal or production services. In 2026, “on-premise” can mean being hosted in an enterprise’s own data center or in projects they own on a public cloud like AWS or GCP. Many of the reasons for on-premise hosting overlap with those for model self-hosting.
-
Multi-cloud. Many enterprises have a multi-cloud strategy; an enterprise agent platform will need to be available across public clouds.
-
Compliance. Enterprises have regulatory concerns and compliance requirements from their customers. Enterprise agent platforms need to meet standards and be audited for SOC 2, FedRAMP, HIPAA, PCI-DSS, etc.
-
Enterprise integrations. Any enterprise will have a mix of modern and legacy systems that agents might treat as the system-of-record, use as identity providers, request services from, or post updates to. An agent platform needs a batteries-included approach to the enterprise stack: Google Workspace, Microsoft Teams, Slack, Jira, Notion, Salesforce, and ServiceNow, among others. Ease of adaptation to legacy or bespoke enterprise systems is a hard requirement.
-
Persona separation. Agents have many stakeholders in an organization: agent builders, agent consumers, operations, security, and governance. An enterprise agent platform needs to feature the different persona requirements and UX as a first-class concern.
-
Multi-tenancy at scale. Enterprises have many employees, teams, products, partners, and external customers who are building and consuming agents. Per-agent overheads need to be in the right order of magnitude to support this dynamic. An anti-pattern is OpenClaw, which, alongside its serious security concerns, has an always-on tax of 2 GiB of memory per instance that sits idle most of the time.
Skill-based agents also have their own platform concerns. Skills should be treated as first-class entities in the platform data model. Skills need to be built, deployed, versioned, tested, evaluated, secured, shared, and governed across their lifecycle. Enterprise employees should not be e-mailing .zip files as the primary form of skill sharing or downloading skills from ClawHub.
An enterprise-ready platform for skill-based agents will also require:
-
A built-in agent harness with support for agentskills.io. Until recently, AWS Bedrock AgentCore only had support for agents-as-code; you could bring your own harness, which had the advantage of flexibility but also implied serious friction for the skill-based agent model. Bedrock’s recent introduction of AgentCore harness (with support for agentskills.io) validates the trend towards platforms with built-in skill-based agent harnesses. Not all agent platforms have this; for example, the recently announced Gemini Enterprise Agent Platform still follows the agents-as-code model at the time of writing, with ADK wrapping required for skill distribution.
-
Skill catalogs. Provenance-aware, company-specific marketplaces or catalogs that enable distribution, sharing, and vetting of skills in the enterprise. These marketplaces exist for agents in agent platforms but don’t always treat the skills themselves as first-class governed entities.
-
Skill-scoped security policy. Sandboxing and assigning identity to an agent coarsely is one thing; doing it well is another. Excess agency is a well-known security consideration in the agent space. Ideally, fine-grained policy controls can be applied to every single file system and network interaction by the agent. Identity should reflect not only the agent resource, but also the invoking user on whose behalf the agent runs, as well as the unique session.
For skill-based agents, we care about ensuring that less-trusted code in skill scripts is governed by principle-of-least-privilege via sandboxing and subject to skill-level policy controls. This is tool security policy with skill awareness. If an agent makes use of a script to read from a customer database and posts summaries to Trello, it’s an unnecessary risk for the database authentication token to be accessible in the sandbox when running the Trello script. Skills should declare their network and API dependencies in manifests so that high-trust and low-trust skills can’t be mixed together accidentally. This is part of how we can structurally tackle the skill trust issue.
-
Skill evals. Skills are intended to be composable and reusable. They need to be evaluated on their own, alongside agent-level evals.
Below we compare existing agent platforms along some of these dimensions. The focus is on the major enterprise and skill-relevant criteria where platforms disagree. The gaps are where we at ReadyLoop.ai see an opportunity to build better.
ReadyLoop.ai
ReadyLoop is a managed agent platform for skill-based agents. We built ReadyLoop to close the gap from where agent platforms were to where they need to be to achieve enterprise readiness for skills as the agent foundation.
In the next post in this series I’ll walk through the ReadyLoop architecture, which we designed to meet the requirements of enterprise skill-based agents. Subscribe below for future updates and feel free to check out agent.readyloop.ai to see the platform in action.