AI Engineering Product Template for Reliable Coding Agents

TLDR;

Many AI coding agents eventually devolve into chaos. This template is my attempt to stop that from happening: AI Product Template

Diagram comparing vibe coding vs AI-first coding with verification: vibe coding often fails production quality standards, AI-first approach consistently passes.

Goal:

We all want self-driving software, aka:

Build products fast
Experiment with features
Use multiple agents to work for me in parallel, for hours, days, weeks
Run multiple agents in parallel on isolated feature branches with independent verification environments.
Go beyond “vibe – coded” POCs
Actually run this in production

Marketing promises this, but empirical reality says this is not possible yet (at least for me). Very quickly, your product can become a “hot mess.”

So there are several principles I follow to make this possible:

Principles:

Prioritize simplicity (always)

Test, test, and test again. Integration tests and agent evaluation are first-class citizens in any design.

Focus on security.

Implement guardrails.

Perform rigorous verification for any change.

Maintain a good overview of agents.

Long story short: From generations to verifications loop – success is good verifications I am confident about.

Circular loop diagram showing “Generations (AI coding agent)” feeding into “Verifications (guardrails),” with a gear and shield icon in the center.

And company strategy view – verification at scale for multiple products, teams, agents, and features is the main goal! Note this down as one of the main responsibilities of technical leadership.

Diagram titled “Company-wide AI development cycles (parallel workflows)” showing Team A running multiple feature cycles in parallel and Team B running separate cycles for a research project and infrastructure work.

How do we start with it? My answer – Template!

Template:

My current best answer – a custom template – is very feature-slim, simple, testable, extensible, and follows the principles I outlined above as much as possible!

Feature-slim: Agents can now write any features, so you don’t need much prebuilt stuff.

Simple/Managed Complexity: Rely on platforms like Railway and Modal to handle the heavy lifting.

Testable: Integration tests are first-class citizens. (critical to prevent a hot mess)

Extensible: Designed to run parallel feature branches seamlessly.

Frontend: TypeScript + Vite for the UI

Backend: Python + FastAPI for the API

CI/CD: GitHub Actions

LLM: Gemini 3 (text, vision, live API, RAG) + DSPy for proper prompt optimization

Auth: Supabase

Database: Postgres

Platform: Railway

ML: Modal

Orchestration: Dagster

Error Monitoring: Sentry

LLM Monitoring & Evaluation: Braintrust

Hand-drawn stack diagram listing frontend (TypeScript + Vite), backend (Python + FastAPI), database (Postgres), and supporting tools like Dagster, Supabase, Sentry, Braintrust, Railway, Modal, and GitHub Actions.

A nice bonus to make it really “self – driving software” is that each tool has its own MCP server exposed to agents. (Deep dive for infrastructure engineers: Build a Self-Healing K8s Agent with LibreChat MCP)

GitHub: Github MCP Server

Supabase: Supabase MCP Server

Postgres: Postgres MCP Server

Railway: Railway MCP Server

Dagster: Dagter MCP Server (deep dive about it: Dagster LLM Orchestration MCP Server)

Sentry: Sentry MCP Server

Braintrust: Braintrust MCP Server

By standardizing these tools via MCP, we don’t just give the human a toolkit; we give the Agent a standardized interface to control the entire infrastructure.

And most important – AI engineering first! What do I mean by this?

Feature branches – each agent has its own branch. (Deep dive: Cursor Railway Vibe Coding PR Environments)

Bulletproof testing and evaluations – CI/CD, customer criteria, end-to-end tests each agent can run on demand.

Each agent has its own cloud environment and can be verified independently.

Anyone can contribute to the project: via Slack, Web, API, Custom UI, Editors.

Simple check for it:

Could 10 agents run in parallel and produce meaningful results?

Do you have evidence to prove AI coding agent output is ready to merge?

In the case of an enterprise use case, your stack may be more complicated and vary widely, but the core principle – AI engineering first – still holds true in every case.

Code

Full code – give it a try! Spin up multiple products and features from it, experiment in parallel, all while keeping your agents on the leash. As a starting point, I have a very minimal design and two features as examples:

Agentic – the user uploads a PDF and the app generates the best visualization for it. The output is hard to predict or manage.

“Agentic feature” page with a file upload control and a “Generate Visualization” button for turning a PDF into a D3.js visualization.

Agentic feature result showing a generated dashboard from a sample PDF, including a paragraph-length bar chart and a bubble chart of frequent terms.

It has simple evaluations in the form of integration tests:

Does it work? test_upload_pdf_returns_valid_response

Does it produce a valid format? test_upload_pdf_returns_executable_js

Does another LLM think it’s good? test_upload_pdf_llm_judge_evaluation

Does it perform well based on labeled data from before? test_upload_pdf_compare_to_historic_data

Simplified code for the “Does another LLM think it’s good?” part.

def test_upload_pdf_llm_judge_evaluation(client, sample_pdf_path):
    """
    Integration Test: 
    1. Uploads a PDF to the Agent.
    2. Captures the generated visualization code.
    3. Uses a 'Judge' LLM (Gemini) to grade the output.
    """
    
    # 1. Act: Upload PDF and get the agent's response
    with open(sample_pdf_path, "rb") as f:
        response = client.post(
            "/agent/visualize",
            files={"file": ("sample.pdf", f, "application/pdf")},
        )
    
    assert response.status_code == 200
    generated_code = response.json()["d3_code"]

    # 2. Arrange: Set up the Judge
    judge_client = genai.Client(api_key=settings.gemini_api_key)
    
    judge_prompt = f"""
    You are an expert code reviewer.
    Evaluate this D3.js code generated from a PDF.
    
    Code:
    ```javascript
    {generated_code}
    ```
    
    Check for:
    1. Valid JavaScript syntax.
    2. Meaningful visualization logic.
    
    Respond with ONLY "PASS" or "FAIL" followed by a reason.
    """

    # 3. Assert: The Judge decides if the test passes
    judge_response = judge_client.models.generate_content(
        model="gemini-3-pro-preview",
        contents=judge_prompt
    )

    result = judge_response.text.strip().upper()
    assert result.startswith("PASS"), f"LLM Judge rejected the code: {judge_response.text}"

def test_upload_pdf_llm_judge_evaluation(client, sample_pdf_path):
    """
    Integration Test: 
    1. Uploads a PDF to the Agent.
    2. Captures the generated visualization code.
    3. Uses a 'Judge' LLM (Gemini) to grade the output.
    """
    
    # 1. Act: Upload PDF and get the agent's response
    with open(sample_pdf_path, "rb") as f:
        response = client.post(
            "/agent/visualize",
            files={"file": ("sample.pdf", f, "application/pdf")},
        )
    
    assert response.status_code == 200
    generated_code = response.json()["d3_code"]

    # 2. Arrange: Set up the Judge
    judge_client = genai.Client(api_key=settings.gemini_api_key)
    
    judge_prompt = f"""
    You are an expert code reviewer.
    Evaluate this D3.js code generated from a PDF.
    
    Code:
    ```javascript
    {generated_code}
    ```
    
    Check for:
    1. Valid JavaScript syntax.
    2. Meaningful visualization logic.
    
    Respond with ONLY "PASS" or "FAIL" followed by a reason.
    """

    # 3. Assert: The Judge decides if the test passes
    judge_response = judge_client.models.generate_content(
        model="gemini-3-pro-preview",
        contents=judge_prompt
    )

    result = judge_response.text.strip().upper()
    assert result.startswith("PASS"), f"LLM Judge rejected the code: {judge_response.text}"

Deterministic – simple CRUD on “items” (no AI), just boring stuff (which is hugely valuable).

App UI showing the “Deterministic feature” page with a Create Item form and an empty items list (loading state).

“Deterministic feature” CRUD screen listing three items, each with Edit and Delete actions, plus a Create Item form at the top.

CRUD for items – you’ve seen it before, and testing it is very straightforward: Arrange-Act-Assert pattern. This seems easy to add, but I saw multiple times (and honestly I’m guilty of this myself) – if you miss adding it at the right time, the cost of this might be very high.

Never undervalue this stage, and always ask the agent to add integration tests and make sure to follow TDD!

Both are important, and both are must-haves.

Feature branches – when you add a new feature or build on top of Agentic feature and Deterministic feature, make sure the agent has full availability to have separate environments.

Based on this, you can add new features in parallel, on top of existing ones or combinations.

Outcome:

I am stress-testing this template and contributing back my findings, opinions, and learnings.

So far it’s a my safety harness. It allows me to unleash 10 agents on a codebase knowing that if they mess up, the guardrails will catch them before production.

My main recommendation for engineering leaders – no matter your stack – is to empower AI engineering by defining a set of principles and strong verification mechanisms at the company strategy level, and making sure you are consistently following them.

AI Engineering Product Template

TLDR;

Goal:

Principles:

Template:

Code

Outcome:

Like this:

Related

Leave a ReplyCancel reply

TLDR;

Goal:

Principles:

Template:

Code

Outcome:

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Kyryl Opens ML