Case Study: How Vibe Coding Powered a High-Velocity ML Team

TL;DR: We built 1 production product, 1 semi-production tool, and 3 POCs in 2.5 months using vibe coding. Here’s how we did it and what we learned about the gap between AI-generated prototypes and real production systems.

Lessons Learned

  1. POCs shine, production still fights back.
  2. Keep agents on a tight leash – specific prompts + diff review.
  3. Bigger scope ⇒ harder everything, trim aggressively.
  4. Fewer lines cost less to own than more.
  5. One-page spec before the first prompt saves hours (grab the template).
A man in a suit leaps off a rocky cliff with arms spread wide, suspended mid-air against a cloudy sky - symbolizing bold leaps into the unknown, risk-taking, and the chaotic energy of vibe coding.

Vibe Coding Is Everywhere

The term “vibe coding” – popularized by Andrej Karpathy – perfectly captures what’s happening in software development right now. There’s even a book about it by Rick Rubin, the famous music producer, about the art of vibe coding.

A minimalist book-style cover titled “The Way of Code: The Timeless Art of Vibe Coding,” featuring overlapping black circles in a meditative style. The design echoes Taoist aesthetics, referencing Lao Tzu and Rick Rubin, and evokes a contemplative fusion of coding and creative flow.

Definition: Vibe coding = an LLM-first, rapid-iteration workflow where you prompt, get code, run it, and iterate based on the “feel” rather than a full up-front spec.

LLMs are exceptionally good at writing code. Just look at the benchmarks: SWE-bench, Aider leaderboards, and countless others show AI crushing coding tasks.

A leaderboard table titled "Aider LLM Leaderboards" comparing various large language models on coding tasks. It shows metrics like percent correct, cost, command used, and correct edit format. Top performers include o3-pro (high) with 84.9% accuracy and $146.32 cost, and gemini-2.5-pro-preview-06-05 (32k) with 83.1% accuracy and lower cost. The table emphasizes performance vs. cost trade-offs in AI-assisted code editing.

And developers are using it. A lot. According to Anthropic’s economic index, computer-related jobs show a massive outlier usage of LLMs compared to every other profession.

A horizontal bar chart titled “AI usage by job type,” comparing the percentage of U.S. workers (black dots) versus percentage of Claude AI conversations (orange dots) across occupations. Computer and mathematical jobs show a dramatic overrepresentation (37.2% of conversations vs. 3.4% workforce share). Other fields like education, media, and business also show elevated usage, while sectors like transportation, food service, and construction show lower AI engagement.

Market Response

The market has responded enthusiastically, as shown in the Infrared report. We now have:

A slide titled “AI is already changing how we build software,” categorizing AI tools into three groups: App Builders (e.g., Replit, Lovable, Bolt), Coding Assistants (e.g., GitHub Copilot, Cursor, Windsurf), and Autonomous Coders (e.g., Claude Code, Codex, Devin). It contrasts no-code platforms for everyday users with co-pilot tools for software engineers and emerging autonomous coding agents designed to handle complex development workflows. Revenue estimates and acquisitions are listed, highlighting rapid market growth.

Sometimes the market reacts badly – like the Internet of Bugs YouTube channel debunking Devin’s Upwork demo, or Answer.AI’s “Thoughts On A Month With Devin”.

A YouTube video titled “Debunking Devin: ‘First AI Software Engineer’ Upwork lie exposed!” by Internet of Bugs, showing a bald man gesturing in front of bookshelves. A large blue graphic on the thumbnail reads “This Is A Lie.” The video critiques the validity of claims made about Devin, an AI software engineer. The Internet of Bugs channel focuses on software careers, bugs, and LLMs, with 81.9K subscribers.

But usually, we get good products.

✏️ Editors🤖 Agents🎨 UI Builders
Cursor
AI-first code editor
Windsurf
The IDE for AI agents
VS Code Copilot
GitHub’s AI pair programmer
Zed
High-performance multiplayer editor
Kiro
Agentic IDE for production code
Cursor Agents
Autonomous coding agents
OpenAI Codex
Powers GitHub Copilot
Jules by Google
AI coding companion
Claude Code
Anthropic’s coding assistant
Gemini CLI
Google’s AI in terminal
Devin
AI software engineer
Lovable.dev
Build apps with AI
Bolt.new
Full-stack web dev in browser
V0.dev
UI generation by Vercel
Gemini Canvas
Google’s AI workspace
Spark
Dream it. See it. Ship it.


Editors → Agents ← UI Builders → Editors

Lines are blurring: editors now ship built-in agents, agents offer editor plug-ins, and UI builders bundle both – progress in any layer instantly upgrades the whole stack.

Want to see how easy it is? Here are three apps I built with Bolt in one shot:

Mind the Gap

A man performs a split between two moving Volvo Globetrotter trucks on a highway at sunset. The scene symbolizes balance and precision, often used metaphorically to illustrate navigating challenges or bridging difficult divides - such as the gap between AI prototypes and production systems.

Here’s the crucial distinction: “Good at coding” ≠ “Good Software Engineer”

Good software engineers solve problems with code. Sometimes the best code is the one you never write.

There’s an interesting benchmark here: If AI is so good at coding, where are the open source contributions? Projects like NumPy, PyTorch, Hugging Face Transformers, and PostgreSQL have existed for decades. If AI is so good at coding, why aren’t there major contributions to these frameworks?

A comic-style illustration of a woman and a man wearing large, futuristic VR headsets connected by robotic cables, with bright yellow halftone background. The blog header reads “Pivot to AI” with the tagline “It can’t be that stupid, you must be prompting it wrong.” Caption below asks, “If AI is so good at coding … where are the open source contributions?” - highlighting skepticism about AI’s real-world software impact.

Reference : https://pivot-to-ai.com/2025/05/13/if-ai-is-so-good-at-coding-where-are-the-open-source-contributions/

The Real Cost of Software

As Jeff Atwood says in “The Best Code is No Code At All”, the real cost of software isn’t writing it – it’s owning it:

  • Infrastructure
  • Support & maintenance
  • Security updates
  • Monitoring & observability
  • Upgrades & migrations

Writing code: ~20% | Maintaining it: ~80%

I think of vibe coding as a new abstraction layer. We went from machine code → assembly → C → high-level languages → and now AI-assisted programming. Back in the 90s, if you wrote in Python, people would say you’re not a real software engineer!

This is a higher abstraction with the flexibility of programming languages, but non-deterministic. As Martin Fowler writes about LLMs and abstraction, we’re not just moving UP in abstraction, but SIDEWAYS into non-determinism.

A layered diagram showing the abstraction hierarchy in software development. The vertical axis moves from low-level technical detail (“0s and 1s” and bytecode) to high-level abstraction (natural language). Flow arrows indicate how compilers convert programming languages into bytecode, and code generators convert models/DSLs into programming languages. “Low code” and “Frameworks” sit between models and languages. A large orange block labeled “Natural language” connects directly to all abstraction levels, showing how modern tools allow human-like, semi-unstructured interaction across the stack.

AWS’s Kiro is attempting to bridge this gap with spec-driven development, similar to how project management works at Big Tech.

Case Study

8-Person Team, 10-Week Sprint

Context first: I joined a Toronto venture studio focused on HCI (Human-Computer Interaction) to stand up an ML arm from scratch.

Goal: build prototypes for portfolio companies and ship at least one production product – all inside one quarter.

The Team

  • 4 computer science interns
  • 1 full stack engineer
  • 2 designers
  • 1 ML advisor (me)

Every interview included rigorous checks for AI tool usage. We went all-in on Cursor Team Plan.

Our Workflow

Team → Cursor (All LLMs, UI, Editor, Agents) → GitHub → Railway

A linear workflow diagram showing four connected blocks:

“TEAM” (pink)

“CURSOR” (purple) labeled with “All LLMs, UI, Editor, Agent”

“GITHUB” (green) labeled “for code”

“RAILWAY” (yellow) labeled “for deployment.”
Arrows connect each stage left to right, illustrating the software development pipeline from team input to deployment.

The Numbers

Looking at our Cursor analytics:

A line graph titled “Total Line Changes from Chat” showing two data series from June 18 to July 17. Green line represents “Total Suggested Lines,” peaking above 10,000 several times. Blue line represents “Total Accepted Lines,” peaking around 4,000. The chart reveals frequent code suggestions with a smaller proportion accepted, indicating iterative LLM-driven coding activity.
A pie chart labeled “Chat Model Usage” showing distribution: claude-4-sonnet-thinking (44.4%), claude-4-sonnet (30.9%), and default (24.7%).
A bar graph titled “Chat Request Types” displaying daily usage breakdowns of Agent (yellow), Edit (green), Ask (blue), and Cmd+K (purple) chat actions from June 18 to July 17.
  • ~500 lines of code accepted per user daily.
  • ~1,500 lines generated.
  • Claude Sonnet 4 most popular model.
  • Agent requests dominate overall usage.

GitHub stats over 90 days:

  • 335 Pull Requests
  • 811 Commits
  • 26.1 PRs/week velocity
  • 10.2 hours average merge time
  • 90% CI/CD success rate

What We Built in 2.5 Months

1 production product – 100 companies used so far.
🚀 1 semi-production product – Newsletter with human in the loop.
🔬 3 POCs – Customer interviews & testing.
🔍 1 internal LLM evaluation framework based on Langfuse.

Patterns for Non-Technical Folks

We encouraged everyone – even non-engineers – to use vibe coding. Three patterns emerged:

Pattern 1: Slack

Create a dedicated Slack channel (e.g., #ml-team-vibe-coding) and message:

@Cursor repo=<repo-name> "Write what you want to do"

Wait for the green checkmark ✅, then ask an engineer to review.

Pattern 2: Web

Go to cursor.com/agents, select your repo, and chat away.

Pattern 3: Mobile

Same as web, but from your phone. Code from the beach! 🏖️

What’s Missing

There’s no “Accelerate” for AI coding yet. The fundamental research on efficient engineering organizations – Continuous Delivery, Architecture, Product and Process, Lean Management and Monitoring, Culture – needs a refresh for the AI era.

Book cover of “Accelerate: The Science of Lean Software and DevOps – Building and Scaling High Performing Technology Organizations” by Nicole Forsgren, PhD, Jez Humble, and Gene Kim. The design features dynamic horizontal bars in blue, green, and white on a dark gray background, evoking speed and flow. Includes forewords by Martin Fowler and Courtney Kissler, with a case study by Steve Bell and Karen Whitley Bell.

The key insight comes from Karpathy’s talk on keeping agents on the leash:

Instead of giving vague prompts like “Write me a Python rate limiter that limits users to n requests per minute”, be specific: “Implement token-based rate limiter in Python with following requirements…” The more constraints you provide, the better the output.

A slide contrasting two AI prompts for writing a Python rate limiter. The first prompt is vague: “Write a Python rate limiter that limits users to 10 requests per minute.” The second is a detailed version including requirements like user identification, thread safety, cleanup, and output format, along with additional considerations (e.g., clock drift, memory leaks). The message illustrates how precise prompting improves AI-generated code quality.

If your verification is small, easy, and fast, you’re in a good position. Otherwise, you’re not.

A circular diagram illustrating the workflow of partial autonomy in UI/UX systems. It features two looping arrows: one labeled "AI" (blue, pointing right for generation) and the other "HUMAN" (red, pointing left for verification). Red and blue node-link networks represent data flow. Two annotated tips emphasize: (1) “Make verification EASY, FAST to win,” and (2) “Keep AI on a tight leash to increase the probability of successful verification.”

Summary

Vibe coding is here. It’s not replacing engineers – it’s giving us better abstractions. The gap between prototype and production is still wide, but it’s narrowing.

Key takeaways:

If you’re not using these tools yet, start with the basics: give every engineer Cursor Pro and API keys. Run an onboarding session. You’ll get 80% of the value for 20% of the effort.


What’s your experience with vibe coding? What worked? What didn’t? What surprised you?

Leave a Reply

Scroll to Top

Discover more from Kyryl Opens ML

Subscribe now to keep reading and get access to the full archive.

Continue reading