Vibe coding with verification

Vaibify’s tagline is Vibe boldly. Verify everything. This document is about the second half. It describes a methodology for writing agent-facing documentation so that an AI coding agent can contribute safely to a scientific software repository, and so that the documentation itself never drifts away from the code it describes.

The principles are repository-agnostic. Vaibify is used as a worked example because it is a mid-sized Python and JavaScript codebase with more than two thousand unit tests, frequent refactors, and several collaborators of mixed software backgrounds.

This is a methodology guide, not a quick reference. Expect to read it in one sitting, then come back to individual sections when you are setting up a new repository.

1. Motivation: why agent docs drift

AI coding agents are stochastic. The same prompt, run twice, can produce different code, different file names, and different imports. This is not a defect of any particular model; it is a property of how these systems sample from a distribution of plausible completions.

Scientific software tolerates stochasticity poorly. A simulation that gives a slightly different answer each time it runs is not a simulation, it is a hazard. The same is true for a codebase: if an agent silently rewrites a helper function the wrong way on Tuesday because Monday’s prompt happened to produce a different result, a human reviewer will not catch it until a test fails months later. The most dangerous mistakes are the ones that are plausible.

The natural response is to write more documentation. Tell the agent where everything lives, what the modules are called, how many files are in each package, what the imports look like. This works for one afternoon. Then somebody refactors a module, or adds a new route, or splits a file in two, and the documentation quietly becomes wrong.

Vaibify’s previous CLAUDE.md hard-coded line counts for more than thirty-five Python modules. Within a few weeks the counts were off by ten to forty percent, and a newly added scriptReposPanel.js module was missing entirely from the map. An agent reading that file had two bad choices: trust the stale data and generate confidently wrong code, or ignore the map and re-derive the architecture from scratch every session, producing inconsistent results. Neither is acceptable in a scientific setting where reproducibility is a primary value.

The root cause is that the documentation was trying to do two jobs at once: it was stating rules that cannot be tested, and it was reciting facts that should never have been typed by hand. Untangling those two jobs is the core of the methodology.

2. Deterministic versus stochastic documentation

There are two fundamentally different kinds of content in any architecture document.

Deterministic signals are facts that the code unambiguously is: the list of modules in a package, the symbols each module exports, the type of an argument, the presence or absence of a test, the result of running a linter. Machines extract these reliably. They cannot drift relative to the code because they are derived from the code. If you write them down by hand, you are creating a second source of truth that is guaranteed to diverge.

Stochastic signals are rules, contracts, intents, hazards, and invariants that span multiple files. They cannot be extracted from any single file or from any mechanical scan. They have to be written by humans who know the system. They are load-bearing precisely because an agent cannot infer them from reading the code.

Examples of stochastic signals in vaibify:

  • “Container paths use posixpath; host paths use os.path.”

  • “Never reassign setExpandedSteps; mutate it in place.”

  • director.py intentionally duplicates two functions from workflowManager.py because it operates on host filesystem paths.”

None of those rules are visible by reading any one file. All three have caused real bugs when an agent or a new developer ignored them.

A documentation system that mixes these two categories in one hand-written file gets the worst of both worlds. The deterministic parts drift, training the reader to distrust the file. The stochastic parts, which are the whole point, get buried under rotting module maps and stale line counts.

The analogy to scientific computing is almost too on-the-nose. Deterministic components of a physical model (conservation laws, boundary conditions, unit conversions) are handled by code and checked by tests. Stochastic components (priors, parameter ranges, stopping criteria) are handled by the researcher. You do not ask your simulation to invent its own priors and you should not ask your documentation to invent its own module map.

3. The four-layer framework

Vaibify’s agent documentation is organized into four layers. Each layer has a different trigger (when the agent sees it), a different source of truth, and a different failure mode. Keeping them separate is what makes the whole system stable.

Layer 1: always-on, semantic

Short prose files stating rules that cannot be tested. These live at AGENTS.md in the repository root, with nested AGENTS.md files in subtrees that have their own conventions (for example, vaibify/gui/AGENTS.md for the FastAPI backend and vaibify/gui/static/AGENTS.md for the JavaScript frontend).

These files are loaded every turn of an agent session. They should contain only content that is load-bearing and cannot be expressed as a test: style contracts, the handful of cross-cutting rules a newcomer would miss, the traps listed in section 5.

Failure mode: silent. When a Layer 1 file is wrong, the agent trusts it and produces subtly wrong code. This is why Layer 1 should stay small.

Layer 2: enforced, deterministic

Architectural invariants expressed as pytest assertions. A single file like tests/testArchitecturalInvariants.py contains tests that assert things like “no module under vaibify/gui/routes/ imports from pipelineServer” or “every route module defines fnRegisterAll” or “no JavaScript file in static/ exceeds two thousand lines”.

These cannot drift. When the rule changes, the test changes in the same commit. The test name, docstring, and assertion together are the rule, which means the rule is self-documenting and executable at the same time.

Failure mode: loud. When a Layer 2 invariant breaks, CI turns red. This is exactly what you want.

Layer 3: on-demand, deterministic

Discovery scripts that extract structural facts from the current code. Vaibify ships tools/listModules.py, which walks the package with Python’s ast module and prints the current module map — path, public symbols (from __all__), and a one-line purpose from each module’s docstring. Line counts and import edges are deliberately excluded as drift bait; if an agent needs either, it should run a targeted tool at the moment of need rather than read a persisted summary. An agent runs this script when it needs the current state of the codebase.

Nothing is persisted. The output is regenerated on every invocation from the live source tree, so it cannot be stale. Other examples include a script that lists all pytest markers, a script that prints the route graph, or a grep that enumerates TODOs.

Failure mode: rare. When a discovery script is wrong, it fails at the moment of use and the agent notices immediately.

Layer 4: conditional, semantic

Multi-step recipes for recurring tasks, loaded only when the task matches. Anthropic Skills (.claude/skills/*/SKILL.md), Cursor rules, and similar tool-specific formats belong here. A skill might encode the steps for adding a new module, reviewing a pull request, or running a security audit.

Failure mode: silent, like Layer 1, because skills are prose. Use sparingly and only for genuinely recurring multi-step work. A skill that fires on every task is just Layer 1 wearing a costume.

Summary

Layer

Trigger

Source of truth

Failure mode

1

Every turn

Hand-written prose

Silent

2

Every commit

The assertion itself

Loud (CI)

3

On demand

The live source tree

At point of use

4

Task match

Hand-written prose

Silent

The important property is that Layers 2 and 3, which cannot drift, carry most of the deterministic content, while Layers 1 and 4, which can drift, carry only the irreducibly stochastic content.

4. The scoping test

When you are tempted to add content to an agent-facing document, run it through four questions in order.

  1. Would a new developer, reading the code alone for twenty minutes, miss this? If the answer is no, do not write it at all. A document full of content that a careful reader could have inferred trains its readers (human and agent) to skim.

  2. Can the rule be expressed as an assertion on the code? If yes, it belongs in Layer 2. Promote aggressively. Tests are the only documentation artifact that cannot lie.

  3. Is the fact extractable from the code? If yes, it belongs in Layer 3. Write a script, do not persist the output. If you catch yourself typing a module list or a line count, stop.

  4. Otherwise, is it a single-step rule or a multi-step recipe? Single-step rules go in Layer 1. Multi-step recurring recipes go in Layer 4.

The common failure is to skip straight to Layer 1 for everything, because prose is easy to write and tests are hard. Resist this. The cost of writing a test once is far less than the cost of a stale paragraph misleading an agent every day for a year.

5. Traps over rules

The highest-value prose content in an AGENTS.md is almost never a list of rules. It is a list of traps: places where the code does the opposite of what a careful reader would expect, or where two things look alike but behave differently.

Traps are what a new contributor cannot discover by reading carefully. They have to be told. A good trap entry names the two things that look alike, says which one does what, and gives the consequence of getting it wrong.

Examples from vaibify:

  • Container paths use posixpath; host paths use os.path. workflowManager.py and director.py contain similarly named functions because one operates on container paths and the other on host paths. Using the wrong one silently produces wrong file paths on Windows, or on any host where the separator differs.

  • director.py looks like a CLI helper but is actually a parallel workflow runner with its own variable resolution. Do not unify it with workflowManager.py without understanding why the two resolve variables differently.

  • _dictUiState contains several Set objects. These sets are captured by reference in the render closure. Reassigning a set (_dictUiState.setExpandedSteps = new Set()) silently breaks the render, which continues to read the old set. Always call .clear() instead.

  • introspectionScript.py duplicates format-handling logic from dataLoaders.py. This is not a refactoring opportunity; the introspection script runs inside Docker containers that cannot import from the host environment.

The exercise, when starting an AGENTS.md, is to ask: what are the five mistakes I would be most annoyed to see an agent make in this repository next week? Write those. Everything else can wait.

6. The feedback principle

When an agent makes the same mistake twice, the documentation system should absorb the lesson. There are three ways to do that, in order of preference.

First, promote the mistake into a Layer 2 test. If the mistake is “an agent keeps importing from pipelineServer in a route module,” add an architectural invariant that fails when any route module imports from pipelineServer. The mistake becomes a permanent guard rail; no future agent can repeat it without turning CI red.

Second, if the mistake cannot be tested, add it to the Lessons section of the relevant AGENTS.md. One paragraph per lesson. State the mistake, state the correct behavior, state the consequence of getting it wrong.

Third, if the mistake is a multi-step task being done inconsistently, add a Layer 4 skill.

Without this loop, documentation stagnates at initial quality. The test-promotion path is the most valuable because it converts stochastic prose (which can be ignored) into deterministic enforcement (which cannot). Over time, a repository that uses the feedback loop well will find that most of its Layer 1 content slowly migrates into Layer 2 tests, leaving behind only the rules that genuinely resist mechanization.

7. Tradeoffs and limits

Honest caveats, because nothing in this space is free.

Tests cover maybe forty percent of the rules in a typical research codebase. The rest stays as prose. It is tempting to over-promise determinism and claim that the right architecture can test everything, but invariants about intent, taste, and scientific meaning resist assertion. Do not pretend otherwise.

One-line module docstrings drift less than paragraph-length ones. Keep docstrings short for routine modules; reserve longer docstrings for the modules that encode emergent semantics a reader cannot infer by reading the functions in order. fileStatusManager.py in vaibify has a long module docstring because it documents a state machine that spans five files; most modules have one-line docstrings because their behavior is visible in their function signatures.

Context-window economics matter less at frontier-model scale than they did a year ago. A two-hundred-line AGENTS.md that is correct beats a sixty-line AGENTS.md that omits load-bearing context. Optimize for correctness, not brevity. That said, every sentence you add is a future drift liability, so only add content that is load-bearing.

Skills are tool-specific. Claude Code’s skill format differs from Cursor’s rules format, and both differ from whatever the next tool will ship. If cross-tool portability matters, stick to AGENTS.md plus tests plus scripts. Skills are a convenience on top, not a foundation.

8. Practical playbook

Here is a checklist for applying this methodology to your own repository. Do the items in order.

  1. List the three to five traps you would be most annoyed to see an agent fall into. Write those first as the core of your AGENTS.md. Do not start with style rules or module maps.

  2. List the architectural invariants you rely on. For each, ask “can this be a test?” Promote the ones that can. Create a file like tests/testArchitecturalInvariants.py even if it only contains three tests on day one.

  3. Do not write a module map. Write a tools/listModules.py (or equivalent) that prints the current state on demand. If you cannot resist persisting a map somewhere, persist the script’s output rather than a hand-written version, and regenerate it on every commit via a pre-commit hook.

  4. Symlink CLAUDE.md to AGENTS.md so every agent tool reads the same file. If your tool of choice uses yet another filename, symlink that too. One source of truth per repository.

  5. Set up a CI path check that greps your AGENTS.md files for file paths and fails on dangling references. This catches the most common silent drift: a module gets moved and the doc still points at the old location.

  6. Iterate on AGENTS.md every time you catch yourself correcting an agent on the same thing twice. Apply the feedback principle: promote to a test if possible, otherwise add a Lessons entry.

  7. Resist the urge to add everything. Each new sentence is a future drift liability. Read the scoping test in section 4 before adding content.

9. Worked examples: vaibify as reference

The patterns in this document are visible in vaibify’s own repository. Cross-references so you can see how they look in practice:

  • AGENTS.md: the Layer 1 semantic doc at the repository root.

  • vaibify/gui/AGENTS.md and vaibify/gui/static/AGENTS.md: nested subtree docs that state conventions specific to the FastAPI backend and the JavaScript frontend.

  • tests/testArchitecturalInvariants.py: the Layer 2 file that encodes every testable invariant.

  • tools/listModules.py: the Layer 3 discovery script. Run it when you need a module map; do not paste its output anywhere.

  • .claude/skills/: Layer 4 recipes for recurring multi-step tasks.

Some of these files are being checked in alongside this document, so a reference may briefly fail to resolve. Once the full set lands, each link above points at the concrete artifact that corresponds to the abstract layer.

10. A closing thought

This methodology is not really about agents. It is about making code legible to any reader, human or machine, without sacrificing the honesty that scientific software requires. In a scientific context, documentation that drifts from code is a reproducibility hazard: future readers (including future you) use the docs to understand what the code did, and a docs-code gap quietly corrupts the record. Agent-friendly docs and reproducibility-friendly docs turn out to be the same thing. The discipline required to keep an AGENTS.md correct is the same discipline required to keep a methods section correct. Vibe boldly. Verify everything.