From Vibes to Contracts: Retrofitting SPDD onto Three AI-Built MVPs

LLM coding has moved the bottleneck.

For most teams, the slowest part is no longer typing code. A developer or agent can scaffold routes, write tests, refactor components, and wire integrations faster than a reviewer can understand what changed.

That creates a problem: the code gets faster, but the system gets harder to govern.

Requirements live in chat. Design intent lives in a planning doc. Safeguards live in someone's head. QA findings live in a report. Deployment decisions live in a status file. Six weeks later, a new agent can usually reconstruct what happened, but reconstruction is not reviewability.

That's why the Thoughtworks / Martin Fowler article on Structured-Prompt-Driven Development caught my attention. The core idea is straightforward: treat prompts as first-class delivery artifacts. Version them. Review them. Reuse them. When reality diverges, update the prompt/spec and the code together.

The article proposes a structure called REASONS:

Requirements
Entities
Approach
Structure
Operations
Norms
Safeguards

I liked the shape. I was also skeptical.

Inside OpenClaw, we already had an agentic delivery loop: issues, durable plans, agent handoffs, implementation PRs, QA gates, deployment evidence, status files, and memory logs. I wasn't interested in adding ceremony for its own sake.

So instead of adopting SPDD because it sounded right, we tested it against real work.

The question

Could we retroactively apply SPDD to MVPs we'd already shipped, compare it against what actually happened, and decide whether it would have improved delivery?

This is a better test than a toy example. The MVPs had real deployment history, real rework, real QA findings, and real operational constraints. If a structured prompt artifact couldn't explain or improve those histories, it probably wasn't worth adding to the workflow.

We picked three deployed MVPs:

A payment-enabled launch MVP with waitlist, pricing, checkout placeholders, analytics placeholders, and production-preview deployment.
A read-only operator dashboard for an autonomous decision system, with safety boundaries, deployment topology, artifact loading, and live-readiness blockers.
A local-first research assistant with domain-specific workflows, BDD scenarios, privacy/advice-boundary guardrails, and cross-module handoffs.

These were intentionally different. One was payment/config heavy. One was operator-safety/deployment heavy. One was product/workflow/UX heavy.

We also looked at the infrastructure around them: deployment metadata, logs, session history, project status files, repo history, QA reports, and planning artifacts. Langfuse traces helped, but they were only one source. The real picture came from combining traces with durable project artifacts.

Method

For each MVP, we ran the same retrospective exercise:

Reconstruct the actual baseline outcome.
Infer the original workflow and prompt shape from available evidence.
Create a retroactive REASONS-style canvas.
Compare the actual implementation against what the SPDD artifact would have made explicit.
Score the expected delta across clarity, reviewability, defect prevention, traceability, and overhead.
Ask the only question that matters: would this have prevented real rework, or just created paperwork?

The point was not to pretend we could perfectly replay history. We couldn't. Counterfactuals are messy.

But we could inspect real defects and real friction, then ask whether a feature-level prompt/spec contract would likely have surfaced them earlier.

MVP 1: payment and configuration boundaries

The first MVP was a launch page with waitlist capture, pricing intent, placeholder payment configuration, analytics flags, admin export, a health route, and deployment-preview hardening.

The actual outcome was solid: the preview was live, waitlist flows worked, missing payment configuration failed gracefully, and external setup blockers were clearly separated from code readiness.

But the history showed where SPDD would have helped.

Payment-enabled MVPs have many "not configured yet" states. Stripe keys may be missing. Price IDs may be placeholders. Webhook secrets may not exist. A refund cron may be deployed before it's allowed to act. Analytics flags may be absent. None of those should become a user-facing 500.

A REASONS artifact would have forced those states into a safeguards checklist before implementation:

What should checkout do when Stripe is absent?
What should it do when a secret exists but Price IDs are placeholders?
What should the webhook route return before the webhook secret is configured?
What should cron do before payments are live?
What is the waitlist fallback?
What smoke test proves the deployment is safe?

The key finding: SPDD wouldn't have produced better code magically. It would have made the failure matrix reviewable before code was written.

It wouldn't have solved external setup. Someone still had to create Stripe products, analytics projects, secrets, and final domains. SPDD can't eliminate human-owned blockers. But it can make the code's behavior around those blockers explicit.

MVP 2: operator safety and deployment drift

The second MVP was a read-only admin dashboard for an autonomous decision system.

This was safety-sensitive by design. Live actions were disabled. Credentials and capital confirmation were blockers. The dashboard existed to inspect system health, decisions, artifacts, readiness, and logs without exposing live execution controls.

The final outcome was good: a healthy preview, useful admin endpoints, artifact visibility, decision detail, readiness state, and safety boundaries.

The friction was around architecture and deployment drift.

An earlier plan described a multi-service shape. The actual safe preview became a simpler single-service read-only dashboard. That was the right decision, but the intent shifted across planning files, implementation, deployment docs, and status logs.

SPDD wouldn't have prevented the shift. In fact, it should have supported it. But it would have forced a sync point:

We planned three services. Reality says one read-only service is safer for preview. Update the artifact, then update the code and deployment docs.

The same applied to artifact loading and logs. A fresh deployment volume could be empty. Logs needed to resolve relative to the artifact root, not a doubled path. The final implementation added fallbacks and tests, but SPDD would likely have named "artifact root," "fallback root," and "operator readiness" as entities earlier.

For safety-sensitive systems, that matters. You don't want reviewers inferring safety posture from scattered commits. You want an explicit contract:

no live controls;
live execution disabled by default;
missing artifacts must not create misleading readiness;
deployment health must prove the dashboard is useful, not merely running;
credentials and capital gates remain external blockers.

Again, SPDD wouldn't solve platform weirdness. It wouldn't fix stale domain ownership or browser certificate cache behavior. But it would reduce ambiguity around what the system was supposed to guarantee.

MVP 3: product workflow and cross-module handoffs

The third MVP was already closest to SPDD.

It had research ingestion, structured domain artifacts, BDD feature IDs, phased implementation plans, QA gates, and local-only privacy constraints. It had workflows for intake, strategy, planning, professional handoff context, screening, review, and downstream analysis.

In other words, this was not "vibe coding." It was already disciplined.

That made it the most interesting case.

The retrospective showed that SPDD wouldn't replace the existing process. It would centralize intent that was currently spread across feature files, build plans, QA reports, audits, status files, and chat handoffs.

The recurring issues were not basic correctness. They were product-shape and handoff issues:

A workflow was technically complete but didn't yet feel like the guided assistant the user expected.
The app risked feeling like a toolbox instead of a journey.
Imported assumptions needed source labels and editability.
Broker and deal-analysis handoffs needed clearer context objects.
"Static smoke passed" was not the same as "real browser demo-ready."

These are exactly the problems that Entities, Approach, Operations, and Safeguards can expose before implementation.

For example, a cross-module handoff shouldn't be described vaguely as "import context." It should name the object:

What fields are imported?
Which fields are editable?
Which fields are missing or stale?
How is source shown?
What must the UI refuse to imply?
What test proves the handoff works?

That's where SPDD earns its keep. Not by adding more words, but by turning fuzzy handoffs into reviewable contracts.

What the scoring showed

Across the three retrospectives, the pattern was consistent.

The existing workflow was already strong. It averaged roughly 3.5 out of 5 across clarity, reviewability, defect prevention, traceability, and overhead.

With a lightweight SPDD layer, the expected score was roughly 4.4 out of 5.

The biggest uplift was not implementation speed. It was:

reviewability;
traceability;
earlier safeguard detection;
architecture drift control;
cross-module handoff quality.

That distinction matters. If you judge SPDD by "will the model write code faster?", you might miss the point. The value is that a reviewer, future maintainer, or fresh agent can answer:

What did we intend, what did we build, what safeguards mattered, and where did reality diverge?

What SPDD does not solve

The retrospective also made the limits clear.

SPDD doesn't solve external blockers. It won't create Stripe products, analytics projects, exchange credentials, production domains, valid database tokens, or missing source material.

It doesn't eliminate live ops. Deployment platforms still have stale routes, disk pressure, certificate quirks, and weird failure modes.

It doesn't replace taste. Product judgment, UX quality, and positioning still require human or specialist review.

And it's too heavy for tiny changes. A one-file bugfix doesn't need a canvas. A copy nit doesn't need a prompt contract. Emergency hotfixes need speed, followed by a note.

The conclusion was not "use SPDD everywhere."

The conclusion was: use a small version where the coordination cost is already real.

What we adopted: REASONS-LITE

We created an OpenClaw-native version called REASONS-LITE.

In OpenClaw terms, REASONS-LITE now sits inside our Open Agentic Development loop. It gives planning agents, implementation agents, and review agents a shared contract instead of making each one reconstruct intent from chat history.

It's intentionally small. One page plus checklist unless the feature is safety- or payment-heavy.

It has eight sections:

Requirements / Definition of Done
Entities / handoff objects
Approach / key decisions
Structure / files touched
Operations / ordered tasks
Norms
Safeguards / acceptance checklist
Prompt/code sync log

The sync log is important. The artifact is not supposed to become stale documentation. If implementation materially diverges, the artifact changes in the same PR or commit.

Examples of material divergence:

a three-service architecture becomes one service;
payment fallback behavior changes;
entity lifecycle states change;
deployment target changes;
a safety boundary changes;
tests prove the original approach was wrong.

The rule is simple:

If future reviewers need to know why the code changed, update the contract that explains the intent.

When we use it

We now trigger REASONS-LITE for non-trivial agentic development work when any of these are true:

Use it when the work involves...	Why
Payments, auth, admin, deployment, persistence, privacy, safety, or regulated-adjacent advice	These features fail in ways that must be explicit before implementation.
More than about five files or thirty minutes of implementation	Reviewers need a compact map of intent, structure, and expected evidence.
New domain entities or cross-module handoffs	Fuzzy object boundaries create integration bugs and UX drift.
Multiple agents or roles	Intent otherwise fragments across handoffs and reports.
Unconfigured external-service behavior	Placeholder states should fail closed, not surprise users.
Demo or production-preview readiness	Smoke tests, manual evidence, and deployment constraints need one checklist.

We explicitly skip it for tiny fixes, copy nits, disposable spikes, and emergency hotfixes.

This is the balance that matters. The artifact must be small enough to use and structured enough to matter.

The deeper lesson

The phrase "prompt-driven development" can sound like the prompt is the important thing.

I don't think that's quite right.

The useful artifact is the smallest reviewable contract between intent, generated code, and production risk.

That contract gives the model something better to follow. It gives reviewers something better to check. It gives future agents something better to resume from. And it gives the team a disciplined place to record when reality changed.

SPDD didn't replace our agentic workflow.

It gave our agents something better to argue with.

And that may be the difference between AI-assisted coding as a personal speed boost and AI-assisted delivery as an organizational capability.