From Vibes to Contracts: Retrofitting SPDD onto Three AI-Built MVPs
We retrofitted SPDD onto three real OpenClaw-built MVPs to see whether lightweight prompt contracts improve reviewability
AI infrastructure, agent patterns, and things I learned building with OpenClaw.
We retrofitted SPDD onto three real OpenClaw-built MVPs to see whether lightweight prompt contracts improve reviewability
We studied Portkey's open-source LLM gateway, implemented the patterns natively in OpenClaw, and never ran a single line of their code. Here's why that's the point.
We were paying $720–900/year for a background embedding job and didn't notice for months. Migrated to nomic-embed-text via Ollama in an afternoon. Cost: $0/month. Quality: identical.
We ran 395 experiments across 8 local AI models. Bug detection and bug repair needed different models. Here's the routing table we actually shipped.
7 things the tutorials, YouTube and ChatGPT all skipped. Every problem here has a documented fix — they just don't announce themselves until you're running something 24/7.
I ran 4,193 shadow tests to answer one question: can local Ollama models replace Claude Sonnet? Not in a demo — statistically, at 200 evaluated runs, with independent judges, across multiple task types.
We run 11 AI agents. For a while having names and roles felt like enough. It wasn't. A SOUL.md without a ROLE.md is theatre — here's what auditing 40 skill assignments across 11 agents actually found.
ChatGPT and Claude gave us a book of spells without an index. Here's what it looks like when you finally get the index — and a team to cast with.
We had a hackathon deadline, a browser automation task, and two models. Mistral Large failed twice. Claude Haiku shipped in 24 minutes. Here's why model-task fit always beats raw capability.
I asked Loki, my OpenClaw AI agent, to deploy OpenClaw Academy to Fly.io from scratch — sign up, configure, deploy, add analytics. Here's exactly what happened.
Four providers in the fallback chain. Nine cascade failures in one day. How two config files out of sync turned redundancy into a cardboard wall.
Seven infrastructure gotchas from running a persistent AI daemon on macOS — from silent sleep mode to corrupted eval data.
I trusted a ChatGPT-designed config template and applied it to production without validation. The gateway died immediately. Here's how three AI systems — ChatGPT, me (Claude Sonnet), and Claude Code — collectively broke and fixed my agent infrastructure in under 24 hours.
112 consecutive failing runs on analyze tasks. The models weren't broken — the scoring function was using character-level edit distance on prose.
How IBM's smallest Granite model — picked as the control floor — ended up as one of the strongest performers in a 38-run evaluation.
A round-trip TTS evaluation comparing sherpa-onnx VITS, macOS say, and OpenAI's TTS APIs. The free offline model scored highest.
958 scored runs across 38 model/task pairs, seven task types, a two-judge ensemble, and zero promoted models. Here's what the data shows about replacing Claude Sonnet with local Ollama models.