AI Skills Are the Worst of Both Worlds

We wanted semantic search.

We got artisanal context plumbing.

That gap is where most of the pain lives.

The promise sounded clean: describe what you want, let the system understand the intent, and retrieve the right capability at runtime. No giant prompt stuffed with every tool. No brittle keyword matching. No human manually deciding which pieces of the world an agent is allowed to remember before the task even starts.

Just intent, matched to action, when needed.

Good idea. Still mostly aspirational.

What we actually built, under names like AI skills and agent skills, is stranger. It is static enough to drift, dynamic enough to surprise you, and expensive enough that everyone eventually starts pretending the plumbing is the product.

The Thing We Wanted

The goal is reasonable. Agents need to do things, not just produce fluent explanations of things that could be done by someone else later.

To act reliably, an agent needs to know what tools exist, what those tools do, how to call them, what they return, and when using them would be a bad idea. That last part matters more than people want to admit. An agent that confidently uses the wrong tool is not autonomous. It is just fast at making someone else's afternoon worse.

There are two coherent ways to build this.

The dynamic version: the agent reasons over live capability metadata, retrieves what applies, checks current system state, and composes the action path from fresh evidence. Context stays small because irrelevant things never enter the room.

The static version: you define the agent's operating surface in advance. It is less flexible, but predictable. You know what it knows. You know what it is allowed to touch. You can audit it without needing a belief system about embeddings.

Both models make sense.

We built the middle.

The Middle We Actually Built

Skills are usually hand-authored documents. Someone writes a description, usage rules, parameters, failure notes, examples, maybe a few warnings about when not to use the thing. That document is static. It represents what one person believed at the time they wrote it, assuming they were correct, awake, and not losing a fight with the markdown table.

Then we embed those documents and retrieve them dynamically. Vector search decides which static descriptions get injected into the agent's context. The agent uses the injected text to decide what to do next.

So we get the maintenance burden of a hand-written knowledge base and the nondeterminism of semantic retrieval.

Beautiful. Like paying rent on two apartments and sleeping in the hallway.

The awkward part is not that this never works. It often works well enough. That is why it spreads.

The awkward part is that it fails in exactly the places where confidence matters.

Similar Is Not Correct

Semantic retrieval gives you candidates. It does not give you truth.

A skill description can sound relevant and still be wrong. It can be accurate when written and stale by Thursday. It can describe the happy path while the real system has grown flags, profiles, permissions, rate limits, and one undocumented behavior everyone depends on because production is a generous archivist of bad decisions.

Embedding similarity does not know any of that. It knows proximity. Proximity is useful. It is not correctness.

This distinction matters because agents are unusually sensitive to wrong context. If a human reads a stale runbook, they might notice the command looks suspicious and ask around. Maybe. If an agent loads a stale skill that looks authoritative, it may simply proceed. Now the failure mode is not obvious refusal. It is plausible action with bad premises.

That is the expensive failure mode. The dashboard may even look fine for a while, because plausible wrongness is annoyingly photogenic.

Knowledge Engineering, Wearing New Shoes

Maintaining a skill library is knowledge engineering. That is not an insult. It is a diagnosis.

You decide what belongs in one skill versus three. You write descriptions that need to work for humans and embeddings, which are overlapping audiences in the same way a map and a weather report are both rectangles. You keep examples current. You document edge cases. You remove old guidance before it becomes a fossil with commit access.

This is expert systems work with better models around it. The tooling changed. The old problem did not politely leave the room.

And the reason we are doing it is revealing: models are not yet reliable enough to infer safe action directly from raw interfaces, schemas, docs, and live state. So we build scaffolding. We give the model a curated shortcut through the world.

Then the shortcut becomes infrastructure. Then the infrastructure becomes policy. Then someone asks why the agent ignored the new API behavior, and the answer is that the semantic database was built by hand and nobody updated the sign.

What to Build Anyway

None of this means skills are useless. They are useful. They are often the least bad way to give an agent operational shape.

But they need to be treated like contracts, not vibes.

Descriptions should be small, current, and testable. The skill should say when not to use it. Examples should exercise real invocation paths, not imaginary happy paths. If a tool changes, the skill should fail review until the description changes with it. If retrieval selects the wrong skill, that should be observable, not a shrug buried in a trace no one reads unless the demo catches fire.

The more a skill can cause real side effects, the more boring it should be. Clear name. Clear scope. Clear stop conditions. No heroic prose. No marketing fog. No pretending that fuzzy lookup has become understanding.

The boring version is better because boring things are easier to distrust correctly.

Where This Leaves Us

The current skill model is a pragmatic compromise. It ships. It helps. It also carries costs that are easy to underestimate: authoring overhead, retrieval uncertainty, stale guidance, hidden coupling, and a constant temptation to call a prompt library an operating system.

The future probably needs better grounding, not just better retrieval. Models need to reason over live system state, schemas, permissions, docs, examples, and consequences with less hand-curated mediation. That is a capability problem. We are not there yet.

Until then, we are building semantic databases by hand. That can be the right tradeoff. It is just worth naming honestly.

An AI skill is not really a skill. It is a well-organized prompt artifact with a fuzzy lookup and, if you are careful, a few guardrails around the sharp parts.

Useful? Yes.

Magic? No.

We are in the scaffolding phase. The scaffolding is load-bearing. Build it like it will be there for a while, because it will.