Your Claude Code Setup Is Infrastructure—Treat It Like One
Here’s an uncomfortable truth: that system of custom agents, skills, commands, and rules you’re building in Claude Code? It’s infrastructure. Not scaffolding. Not convenience tooling. Infrastructure.
The kind that breaks silently when dependencies shift. That cascades failures across subsystems. That resists debugging when things go sideways.
Most people don’t treat it that way. They should.
The Fragility Isn’t Where You Think It Is
Traditional infrastructure is deterministic. Configure a database schema—same query, same result. Define a deployment pipeline—same trigger, same sequence. Predictability is the point.
AI artifacts are probabilistic. Even with temperature set to zero, the same prompt produces different outputs—GPU kernel non-determinism, floating-point precision drift. Instructions get interpreted differently across runs. And when one agent hallucinates? Studies show a single compromised agent poisoned 87% of downstream decision-making within four hours.
This creates a tension: you’re building infrastructure with materials that don’t behave like infrastructure materials. And the more interconnected your artifacts become—agents calling skills calling other agents, commands orchestrating multi-step pipelines, rules governing execution flows—the more this non-determinism compounds. The uncertainty doesn’t add linearly; it multiplies.
The fragility isn’t in the individual artifact. It’s in the dependencies between them.
What “AI Drift” Actually Looks Like
You update a skill’s instructions to improve clarity. The skill works perfectly in isolation. But now an agent that depends on that skill starts producing subtly different outputs. Not broken—just different. And three other artifacts that depend on that agent’s output now drift too.
This isn’t a bug. It’s the nature of the system. Research on prompt sensitivity found accuracy drops of 76 points from minor phrasing variations. The agent interprets the new phrasing slightly differently. Downstream consumers adjust. The whole house of cards shifts.
The problem compounds when you use AI to write the artifacts themselves. Claude excels at generating agent instructions, skill definitions, orchestration logic. It also excels at introducing subtle drift—phrasing that’s technically correct but contextually different. Instructions that work but shift the implicit contract between components.
I’ve watched this play out. You ask Claude to improve an agent’s instructions. It does. The agent works. But two other agents that depend on it start behaving differently—not wrong, just different enough that their consumers notice. You’re debugging a cascade that started with a synonym.
The Infrastructure Mindset, Applied
If these artifacts are infrastructure, treat them like infrastructure:
Version control everything. MLOps best practices mandate Git-based workflows for all artifacts—not just code, but every agent definition, every skill instruction, every command file, every rule configuration. Commit with meaningful messages. Tag stable releases. Know what changed when things break.
Test before you deploy. Run the agent in isolation. Verify the skill produces expected outputs. Confirm the command orchestrates correctly. Don’t assume that because Claude generated it cleanly, it works correctly. And definitely don’t assume that because it works in isolation, it works in context.
Monitor for drift. This is harder with AI artifacts than with traditional infrastructure, but it’s more important. LLM observability has become a foundational requirement—you need visibility into what your agents are actually doing, not just what you think they’re doing based on their instructions. Log inputs and outputs. Track when behavior changes. Notice when downstream consumers start producing different results.
Change management matters. LLM governance frameworks emphasize structured pipelines with approval gates—don’t update production artifacts directly. Test changes in a branch. Review diffs before merging. Understand the dependency graph—what calls this agent, what does this skill depend on, which commands will be affected.
Document the contracts. What does this agent promise to do? What inputs does this skill expect? What outputs does this command guarantee? Write it down. When drift happens—and it will—you need to know what the original contract was.
This probably sounds like overkill for “just some prompt engineering.” It’s not. The moment you start building systems that depend on each other, you’re doing infrastructure work. The materials are different. The rigor shouldn’t be.
The Non-Determinism Tax
Here’s what makes this harder than traditional infrastructure: you can’t eliminate drift. Generative AI is probabilistic by nature. Temperature isn’t zero. Interpretations vary. Context matters in ways you can’t fully predict.
Testing can’t just be “does it work”—it has to be “does it work consistently enough.” Monitoring can’t just be “did it fail”—it has to be “did behavior change in ways that matter.” Change management can’t just be “what did we change”—it has to be “how did the interpretation shift.”
This is the non-determinism tax. You pay it in testing time, monitoring complexity, and debugging difficulty. But you pay it either way—upfront through rigor, or downstream through cascading failures.
I think most teams are choosing downstream without realizing they’re choosing at all.
What This Looks Like in Practice
Start with a dependency graph. Not a flowchart—an actual graph of what calls what. Which agents invoke which skills. Which commands orchestrate which agents. Which rules govern which execution paths. Draw it. Update it when things change. Refer to it when things break.
Version control: Every artifact in a git repo, every change in a meaningful commit, every stable configuration tagged as a release. When you update an agent’s instructions, the diff shows exactly what changed in natural language. When something breaks, you can roll back to a known-good state.
Testing: Run the artifact in isolation with representative inputs, verify outputs match expectations, then test in context with real dependencies. If it’s an agent, run it with the actual skills it calls. If it’s a command, execute the full orchestration pipeline. If it’s a rule, verify it governs behavior correctly across scenarios.
Monitoring: Log every agent invocation with inputs and outputs, track behavior changes over time, alert when outputs drift beyond acceptable thresholds. You’re not monitoring for failures—you’re monitoring for subtle shifts that indicate drift is happening.
Change management: Branch for changes, test thoroughly, review with awareness of dependencies, merge only when you understand the full impact. If you’re using AI to generate or modify artifacts, review even more carefully—hallucinations are subtle and propagate insidiously.
Why This Matters Now
Claude Code is early. Most teams are experimenting. Artifact ecosystems are small—a few agents, a handful of skills, some custom commands.
But this compounds fast. Every new agent increases the surface area for drift. Every new dependency creates another potential cascade. Every AI-generated update introduces another opportunity for subtle hallucination to propagate.
The teams that build infrastructure discipline now—when the systems are small—will scale successfully. The teams that treat this as “just automation scripts” will hit a complexity wall where debugging becomes impossible and changes become too risky to make.
The constraint isn’t Claude’s capability. It’s the fragility of interconnected non-deterministic systems—and the discipline required to manage them.
Where to Start
If you’re running Claude Code with custom artifacts, start here:
Version control check: Is every artifact in version control? Can you see what changed in the last month? Can you roll back to last week’s configuration?
Dependency map: Do you know what calls what? Can you predict the blast radius of changing an agent’s instructions?
Testing protocol: Do you test artifacts in isolation before deploying them? Do you verify behavior in context after integration?
Change discipline: Do you review AI-generated artifact changes the same way you’d review AI-generated code? Do you understand what shifted when the phrasing changed?
If any of those answers is “not really,” you’re treating infrastructure as scaffolding. It works until it doesn’t—and when it doesn’t, the debugging is exponentially harder.
The gap between “AI tooling” and “core infrastructure” is narrower than you think. The moment agents call agents, skills depend on context, commands orchestrate pipelines—you’re building infrastructure.
The materials are probabilistic. The discipline should be infrastructure-grade.
I want to hear what you’re seeing. How are you managing drift? What does your testing look like? Where does fragility show up first?
The implementations matter—and they should be shared.

Treating it as infrastructure is the right mindset. What took me a while to sort out though: what exactly you're version controlling. Commands, agents, and skills in OpenCode each do different things, and lumping them together makes the setup harder to reason about. Commands are prompt shortcuts, agents carry the permission model and system prompt, skills supply domain context on demand. Mapped out the differences here: https://blog.devgenius.io/no-commands-skills-and-agents-in-opencode-whats-the-difference-cf16c950b592