Vision
Vaibify is a harness for making computational science reproducible in a world where AI agents are first-class participants in the research process. This document describes what that means, the graded framework we use to reason about reproducibility, and where vaibify sits on it.
For the motivation that preceded vaibify — why secure containment and agent-assisted computing matter — see philosophy.md. For the day-to-day mechanics of the reproducibility stack, see reproducibility.md. This document is intentionally higher-level and forward-looking, so that it can be cited independently of any single vaibify release.
The problem
Two concurrent shifts make the existing reproducibility story insufficient.
Agentic contributors. AI systems now produce plausible scientific output at speeds that overwhelm peer review and without the informal trust signals (reputation, institution, career-staking) that human science relies on. A paper generated by an agent has no career at stake; its reliability has to come from the work itself, not the author. See EviBound (arXiv 2511.05524) for the architectural framing and REPRO-Bench (arXiv 2507.18901) for empirical evidence that agents cannot reliably self-assess reproducibility.
Cryptographic possibility. The building blocks for machine-verifiable scientific claims now exist in production — content-addressed storage (git), immutable archival (Zenodo DOIs, Software Heritage), supply-chain attestation (Sigstore, SLSA), deterministic container builds, and in the longer term zero-knowledge proofs of computation. None of this infrastructure was realistic for working scientists a decade ago.
The opening is therefore to treat a reproducible scientific artifact not as a PDF plus some code, but as a content-addressed chain of evidence whose integrity any third party — human or agent — can verify without rerunning the computation.
The reproducibility ladder
Reproducibility is not a binary. We distinguish five levels, each strictly harder than the last. The structure follows the pattern used in other trust-across-boundaries domains: SLSA in software supply-chain security, evidence-based medicine hierarchies, NIST metrological traceability, art-world provenance. At each level we state what it proves and, equally important, what it does not.
L1 — Self-Consistent
All workflow tests pass. Every file declared as canonical matches the workspace’s last local commit. Test markers reproduce on a fresh clone: the recorded content hashes match what the author verified.
Proves: the author has not silently modified anything between verification and hand-off.
Does not prove: that the author’s state is consistent with anything outside the author’s own machine.
L2 — Published
Every canonical file’s content hash matches what is published at a specific, immutable remote authority: GitHub’s blob at a committed commit SHA; Overleaf’s referenced revision; the Zenodo DOI’s archive for bundled data. The commit SHA is reachable on the public branch.
Proves: a third party can verify the published record matches the author’s container using only hash fetches from public authorities. No trust in the author’s environment required.
Does not prove: that the computation producing those files can be re-executed to yield the same bits.
L3 — Reproducible
The Docker image is reproducibly buildable: docker build . at the
committed Dockerfile produces a byte-identical image hash on any
capable host. The workflow is deterministic — random seeds declared,
BLAS threading pinned, no nondeterministic libraries. Running
vaibify reproduce <url> on a fresh clone regenerates output files
whose hashes match the committed baseline.
Proves: a third party can regenerate the author’s outputs from source + inputs and get identical bits.
Does not prove: that the inputs themselves are authentic.
The jump L2 → L3 is genuinely hard. Deterministic numerics on modern
hardware requires care (OMP_NUM_THREADS=1 or MKL_CBWR=COMPATIBLE,
avoidance of nondeterministic CUDA operations); reproducible Docker
builds require pinned apt versions and SOURCE_DATE_EPOCH. Much of
this tooling exists in the
reproducible-builds community;
vaibify’s role is to package it for working scientists.
L4 — Archived
Every external input is recorded with (source URL, fetch timestamp, content hash) in a committed manifest. Re-fetching the URL produces
a matching hash; archival services (Zenodo, Software Heritage, Wayback
Machine) hold snapshots in case the source disappears.
Proves: the full causal chain from raw observation to published plot is tamper-evident.
Does not prove: that anyone besides the author has independently verified the re-execution.
L5 — Attested
Independent third parties — CI services, collaborators, archival bots — have re-run the workflow and published signed attestations that their output hashes match the committed baseline. Attestations are in a transparency log (Sigstore / Rekor style), so revocation and provenance of the attestation itself are public.
Proves: reproducibility is no longer the author’s claim alone; it is community-verified and the community verification is itself publicly auditable.
Does not prove: anything deeper. This is the ceiling of what hashes and signatures alone can guarantee.
What this framework is and is not
The framework is a vocabulary for stating, and checking, the reproducibility rigor of a computational result. It is strictly independent of vaibify. Any tool that implements container packaging, external hash fetches, reproducible builds, and attestation logs could claim levels on it. Vaibify is an implementation, not the implementation.
Two properties we deliberately keep off the ladder:
Test meaningfulness. Whether a workflow’s tests are actually strong enough to catch subtle errors (mutation-testing coverage, adversarial robustness) is a property a workflow should have at any level. It is a property of the tests, not of reproducibility.
Physics-informed validation. Whether the numerical result respects conservation laws, symmetries, or asymptotic limits is a correctness claim about the science, not a reproducibility claim about the bits. Results can be perfectly reproducible and scientifically wrong.
Both deserve attention; neither is a rung on this ladder.
Where vaibify sits
At time of writing, vaibify targets L1 working, L2 as the near-term goal, L3 as a stretch goal for a companion demonstration (the GJ 1132 XUV paper appendix). L4 and L5 are described here for completeness; the 2026 development effort does not attempt them.
Two deliberate design choices shape the implementation.
Local-first. The workspace is a directory on the scientist’s own machine, mounted into a container; artifacts stay on the scientist’s disk and under the scientist’s control. Contrast with hosted platforms (Whole Tale, Renku, Code Ocean) that require uploading work to run it. Local-first is more architecturally demanding — notably, workspace storage varies across host operating systems — but it respects how working scientists already organize their work and avoids creating yet another platform lock-in.
AI agents as equal participants in the authorship graph. Infrastructure that assumes “the author” is a human and “tools” are what agents use cannot cleanly accommodate a future where significant portions of a workflow are produced by agents. Vaibify’s pre-push manifest check, content-hash-bound test markers, and per-remote freshness badges exist so that agent output is subject to the same verification gates as human output, with no special privilege in either direction.
Scope and non-goals
Vaibify is:
A harness for containerizing a scientific workflow, recording its state in a content-addressed way, and verifying it against external authorities.
An implementation of the reproducibility ladder for a working scientist’s daily use.
A bridge between existing infrastructure (git, Zenodo, Overleaf, Docker) and the reproducibility claims a working scientist wants to make about their own work.
Vaibify is not:
A workflow management system. Nextflow, Snakemake, CWL, and Galaxy exist and are better at workflow orchestration. Vaibify uses a minimal JSON pipeline description because the target user is writing Python or shell scripts, not DSLs.
An AI agent. Vaibify does not generate code, reason about experiments, or draft papers. Systems like Denario, CMBAgent, and Sakana’s AI Scientist do that. Vaibify is the verification harness their output should pass through.
A hosted reproducibility platform. Whole Tale, Renku, and Code Ocean exist and serve users who want cloud-first workflows. Vaibify is local-first by design.
A cryptographic attestation service. Sigstore, Rekor, and in-toto exist for that. Vaibify integrates with them (or will, at L5); it does not replace them.
The larger bet
The scientific community’s trust infrastructure was built on the implicit assumption of human authorship: tenure, journal prestige, institutional affiliation, citation graphs. That infrastructure scales roughly linearly with human reviewers. Agentic research threatens to produce work at rates where reputation-based filtering is structurally unable to keep up.
The bet underlying vaibify is that the locus of scientific credibility will shift, over the next decade, from human expertise to machine-verifiable reproducibility — not replacing peer review, but supplementing it with a verifiable substrate that a reviewer (human or agent) can check in seconds rather than days. Getting astrophysics to L2 in 2026 is a small step toward that substrate. Getting the field to L5 by the time a remote-sensing result needs to be trusted as evidence for life beyond Earth is the ambition this framework anchors.
Vaibify’s contribution is intentionally modest: a working harness that one scientist can use today, shaped by the graded framework above, built to compose with the larger trust infrastructure as it emerges.