AI Engineering Product Template

TLDR;

Many AI coding agents eventually devolve into chaos. This template is my attempt to stop that from happening: AI Product Template

Diagram comparing vibe coding vs AI-first coding with verification: vibe coding often fails production quality standards, AI-first approach consistently passes.

Goal:

We all want self-driving software, aka:

  • Build products fast
  • Experiment with features
  • Use multiple agents to work for me in parallel, for hours, days, weeks
  • Run multiple agents in parallel on isolated feature branches with independent verification environments.
  • Go beyond “vibe – coded” POCs
  • Actually run this in production

Marketing promises this, but empirical reality says this is not possible yet (at least for me). Very quickly, your product can become a “hot mess.”

So there are several principles I follow to make this possible:

Principles:

  • Prioritize simplicity (always)
  • Test, test, and test again. Integration tests and agent evaluation are first-class citizens in any design.
  • Focus on security.
  • Implement guardrails.
  • Perform rigorous verification for any change.
  • Maintain a good overview of agents.

Long story short: From generations to verifications loop – success is good verifications I am confident about.

Circular loop diagram showing “Generations (AI coding agent)” feeding into “Verifications (guardrails),” with a gear and shield icon in the center.

And company strategy view – verification at scale for multiple products, teams, agents, and features is the main goal! Note this down as one of the main responsibilities of technical leadership.

Diagram titled “Company-wide AI development cycles (parallel workflows)” showing Team A running multiple feature cycles in parallel and Team B running separate cycles for a research project and infrastructure work.

How do we start with it? My answer – Template!

Template:

My current best answer – a custom template – is very feature-slim, simple, testable, extensible, and follows the principles I outlined above as much as possible!

  • Feature-slim: Agents can now write any features, so you don’t need much prebuilt stuff.
  • Simple/Managed Complexity: Rely on platforms like Railway and Modal to handle the heavy lifting.
  • Testable: Integration tests are first-class citizens. (critical to prevent a hot mess)
  • Extensible: Designed to run parallel feature branches seamlessly.
  • Frontend: TypeScript + Vite for the UI
  • Backend: Python + FastAPI for the API
  • LLM: Gemini 3 (text, vision, live API, RAG) + DSPy for proper prompt optimization
Hand-drawn stack diagram listing frontend (TypeScript + Vite), backend (Python + FastAPI), database (Postgres), and supporting tools like Dagster, Supabase, Sentry, Braintrust, Railway, Modal, and GitHub Actions.

A nice bonus to make it really “self – driving software” is that each tool has its own MCP server exposed to agents. (Deep dive for infrastructure engineers: Build a Self-Healing K8s Agent with LibreChat MCP)

By standardizing these tools via MCP, we don’t just give the human a toolkit; we give the Agent a standardized interface to control the entire infrastructure.

And most important – AI engineering first! What do I mean by this?

  • Bulletproof testing and evaluations – CI/CD, customer criteria, end-to-end tests each agent can run on demand.
  • Each agent has its own cloud environment and can be verified independently.
  • Anyone can contribute to the project: via Slack, Web, API, Custom UI, Editors.

Simple check for it:

  • Could 10 agents run in parallel and produce meaningful results?
  • Do you have evidence to prove AI coding agent output is ready to merge?

In the case of an enterprise use case, your stack may be more complicated and vary widely, but the core principle – AI engineering first – still holds true in every case.

Code

Full code – give it a try! Spin up multiple products and features from it, experiment in parallel, all while keeping your agents on the leash. As a starting point, I have a very minimal design and two features as examples:

Agentic – the user uploads a PDF and the app generates the best visualization for it. The output is hard to predict or manage.

It has simple evaluations in the form of integration tests:

Simplified code for the “Does another LLM think it’s good?” part.

def test_upload_pdf_llm_judge_evaluation(client, sample_pdf_path):
    """
    Integration Test: 
    1. Uploads a PDF to the Agent.
    2. Captures the generated visualization code.
    3. Uses a 'Judge' LLM (Gemini) to grade the output.
    """
    
    # 1. Act: Upload PDF and get the agent's response
    with open(sample_pdf_path, "rb") as f:
        response = client.post(
            "/agent/visualize",
            files={"file": ("sample.pdf", f, "application/pdf")},
        )
    
    assert response.status_code == 200
    generated_code = response.json()["d3_code"]

    # 2. Arrange: Set up the Judge
    judge_client = genai.Client(api_key=settings.gemini_api_key)
    
    judge_prompt = f"""
    You are an expert code reviewer.
    Evaluate this D3.js code generated from a PDF.
    
    Code:
    ```javascript
    {generated_code}
    ```
    
    Check for:
    1. Valid JavaScript syntax.
    2. Meaningful visualization logic.
    
    Respond with ONLY "PASS" or "FAIL" followed by a reason.
    """

    # 3. Assert: The Judge decides if the test passes
    judge_response = judge_client.models.generate_content(
        model="gemini-3-pro-preview",
        contents=judge_prompt
    )

    result = judge_response.text.strip().upper()
    assert result.startswith("PASS"), f"LLM Judge rejected the code: {judge_response.text}"

Deterministic – simple CRUD on “items” (no AI), just boring stuff (which is hugely valuable).

CRUD for items – you’ve seen it before, and testing it is very straightforward: Arrange-Act-Assert pattern. This seems easy to add, but I saw multiple times (and honestly I’m guilty of this myself) – if you miss adding it at the right time, the cost of this might be very high.

Never undervalue this stage, and always ask the agent to add integration tests and make sure to follow TDD!

Both are important, and both are must-haves.

Feature branches – when you add a new feature or build on top of Agentic feature and Deterministic feature, make sure the agent has full availability to have separate environments.

Based on this, you can add new features in parallel, on top of existing ones or combinations.

Outcome:

I am stress-testing this template and contributing back my findings, opinions, and learnings.

So far it’s a my safety harness. It allows me to unleash 10 agents on a codebase knowing that if they mess up, the guardrails will catch them before production.

My main recommendation for engineering leaders – no matter your stack – is to empower AI engineering by defining a set of principles and strong verification mechanisms at the company strategy level, and making sure you are consistently following them.

Leave a Reply

Scroll to Top

Discover more from Kyryl Opens ML

Subscribe now to keep reading and get access to the full archive.

Continue reading