AI/LLM Engineering
From AI-assisted coding experiments to building a software engineering harness and shipping production LLM integrations.
From Experiments to Production
This journey started with coding alongside LLMs and quickly evolved into something bigger: figuring out how to make AI-assisted development actually reliable. Along the way I built a software engineering harness, shipped LLM-powered analytics at OpenForce, and learned that the hard part isn't generation — it's verification.
The Generaite Labs Era
Started by experimenting with different languages, frameworks, and architectures for AI-assisted development. Built apps in Rust, C#, Python, and TypeScript. Created synthesized documentation frameworks and architectural templates. Discovered that C# was OK for code generation but gets so heavy and verbose at scale that it defeats the benefits. Built AstralMCP (a .NET MCP-to-REST adapter) and PowerPort (multi-tenant repo management) as proving grounds.
The key insight from this phase: the more ambitious your goals, the higher the level of abstraction you need to operate at. Stop thinking about implementation and start thinking about the process of implementation. That insight became the foundation for everything after.
Harness Engineering: CodeMySpec
Nobody was building what I wanted — a tool that injects the full software engineering process into AI-assisted development. So I built it. CodeMySpec is a Claude Code extension that orchestrates requirements, architecture, BDD specs, code generation, and QA for Elixir/Phoenix/LiveView applications.
Why One Stack
Constrained to Elixir/Phoenix/LiveView because there's basically one way to do things in the Elixir ecosystem. Environmental complexity stays inside the application — no Kubernetes, Docker, microservices. Makes it tractable for the model. Boundary library enforces architecture at compile time.
The Extension
Tried VS Code extension, TUI, CLI — none worked. Claude Code extension did. Ships as a Burrito-packaged binary with MCP server, hooks, agents, skills, and knowledge base. Walks you through a fixed workflow from stories to deployed code.
What I Learned Building With It
Fuellytics
Fleet fuel card fraud detection. Stripe Issuing/Treasury/Connect, Twilio SMS, Claude Vision OCR, 5-validator fraud pipeline. 55 commits, 5 active days, 22 bounded contexts, zero human-written code. In UAT pending Stripe production approval.
MetricFlow
Cross-platform ad analytics correlating Google Ads, Facebook Ads, GA4 with QuickBooks revenue. Pearson + time-lagged cross-correlation engine. Claude-powered insights. 40 commits, 13 days, 12 contexts, 6 data source integrations.
The Velocity Problem
The hard part of AI-generated code isn't generating it. Generation is trivial now. The hard part is managing the velocity. CodeMySpec can produce a full context — schema, repository, LiveView, tests, BDD specs — in minutes. Let it run unchecked and you get 100,000 lines of code that compiles, passes its own tests, and doesn't actually work.
The agents built Potemkin villages. They'd catch a FunctionClauseError, wrap it in try/catch, show a "success" flash message, and move on. The QA agent would see the flash and mark the scenario as passing. Both agents collaborating to produce passing tests over broken functionality. The fix: QA must test outcomes, not UI elements.
Testing Layers for AI Code
Unit Tests
Catch implementation errors. Don't know what the user wanted.
BDD Specs
Catch requirement misunderstandings. Don't test the running app.
Story QA
Catches bugs in the real environment. Doesn't test cross-feature paths.
Journey QA
End-to-end flows across features. Catches seam bugs between contexts.
Skip any one layer and that entire category of bug ships to production. Fuellytics had a fraud vulnerability where flagged drivers could clear their flag without submitting photos — the BDD spec explicitly passed, but the QA agent caught it.
LLM Integration at OpenForce
Separate from CodeMySpec, I built production LLM integrations at OpenForce as a data engineer:
Sparqy
AI analytics platform. Natural language question goes to Snowflake Cortex (with TPC-H semantic schemas), returns SQL + Vega-Lite spec. Claude orchestrates dashboard layout via MCP tools. Dynamic Blazor component rendering from LLM-generated JSON.
LLM Query Validation
Validated 170+ report migrations from Tennacle to Snowflake. Ran production queries on the Bastion, saved results to Postgres, ran against Snowflake, used an LLM to iterate on queries until results matched.
Where It's Going
CodeMySpec has 3 customers and ~$3,500 in committed revenue. Two production apps built with it. The core insight holds: constrain the stack, inject the software engineering process, verify relentlessly. The 90% that works is remarkable. The 10% that doesn't will always need a human clicking through it.
Harness engineering — the discipline of making AI development reliable through structured workflows, progressive disclosure, validation hooks, and stop-and-verify loops — is what I'm focused on now. The models will keep getting better. The harness is what makes that power usable.
Key Learnings
Generation is trivial — managing the velocity is the hard part
No single testing layer catches everything in AI-generated code
Agents build Potemkin villages — QA must test outcomes, not UI elements
Constrain the problem space aggressively before trying to automate it
Harness engineering is the discipline of making AI development reliable