Build a Self-Healing K8s Agent with LibreChat + MCP

A minimal-functionality test that patches a broken Deployment – hands-free

TL;DR: We’ll wire LibreChat to a tiny MCP server that can patch K8s resources. No human kubectl dance required.

Abstract 1928 etching by Paul Klee titled “Juggler in April.” The monochrome scene is filled with thin, scratch-like lines forming ladders, tree-like shapes, angular houses, and a small stick-figure juggler near the center, all set against a textured background of diagonal strokes that evoke falling rain.

Vibes of this post based on Tetragrammaton.

Why does this matter?

Shipping features is now lightning-fast – AI editing tools, cloud-based software engineering agents, and no-code tools get us to POC in hours. But owning that code still hurts. Most 3 AM pages aren’t logic bugs; they’re infrastructure blunders – typos in image names, mis-wired services, stray YAML. If we can hand those to an LLM agent that detects the breakage and patches it on the fly, we trade kubectl struggle for a single chat prompt.

Yes, off-the-shelf options like kubectl-ai, and kagent exist, but in this post we’ll build the fix from scratch so you can understand every moving part and tailor it to your stack.

How are we going to build it?

At its core, an “agent” is just an LLM running in a while-loop:
while task_not_done: think → act → observe.
But, as with databases being “just files”, the details decide whether the loop is magic or misery (For a deep dive into what those details look like in practice, see Anthropic’s excellent article “Building Effective Agents”). Our plan:

Pick a build style

PathBest forPopular picks
No-codeDrag-and-drop POCs, non-dev teamsLangflow · Flowise · Dify · n8n
SDK / code-firstFull control, CI/CD, unit testsOpenAI Agents SDK · Smolagents · Agent Development Kit · Agent Squad · LangGraph, LlamaIndex · CrewAI · Qwen-Agent · Semantic-kernel · AIQToolkit
Chat UIEnd-user interface, demo speedLibreChat · AnythingGPT forks

All three can speak MCP (Model Context Protocol) – a language-agnostic contract that lets an LLM do things, not just chat. If you haven’t met MCP yet, read this intro and come back; it’s the secret sauce.

Hand-drawn diagram with three rounded squares labeled “No code,” “SDK,” and “Chat UI,” arranged in a row, with a smaller rectangle below labeled “+ MCP.”

Use-case

For this demo I want to run a “minimum-functionality test” (MFT) for Kubernetes.
The idea comes from NLP: an MFT is something your ML system should always get right. Example – sentiment analysis must label “This is an amazing movie” as positive, no matter the params or model size.

So what’s the MFT for Kubernetes? A simple typo in the Docker image name – ngnix instead of nginx – super common (ask me how I know) and trivial to fix. It’s even part of Google’s benchmark for K8s agents, k8s-bench – see kubectl-ai, kagent and tasks list.

The best part is that if you’re using a different infrastructure platform, the same concepts and patterns can be reused.

Stack & Setup

Hand-drawn diagram of a “kind cluster” box. Inside the top half are three smaller rounded boxes labeled “librechat,” “mongo,” and “k8s mcp.” A horizontal divider separates them from a single box in the lower half labeled “broken deployment.
  • Frontend: LibreChat
    Free, open-source, ChatGPT-style UI that’s fully pluggable.
  • Brains: GPT-4.1 (or your local Llama3)
    Any MCP-aware model will do.
  • Tooling: Custom K8s MCP server
    Exposes three tools: patch_resource, apply_yaml, and logs.
  • Runtime: kind cluster
    Includes a deliberately broken NGINX Deployment the agent will fix.

Try it yourself: run the Makefile and the whole stack comes up in one command.

Cluster right after make up: LibreChat, Mongo, and k8s-manager are healthy; the web-app-error deployment is failing with ImagePullBackOff – exactly the bug our agent will fix.

Headlamp dashboard showing the default namespace. On the left, four Deployments are listed: “web-app-error” (5 pods), “mongo” (4 pods), “k8s-manager” (3 pods), and “librechat” (3 pods). On the right, the Pods table shows “mongo” and “librechat” pods in Running state, while all “web-app-error” pods are stuck in ImagePullBackOff.

A headlamp snapshot.

Agent & Solution

Inside LibreChat you add a new agent, pick an icon, and supply two things:

  • Prompt – the instructions you’ll give the LLM:
LibreChat agent-creation screen: a form titled “Kubernetes management agent” showing a Kubernetes logo, name and description fields, an instruction prompt box outlining the agent’s responsibilities, and the model selector set to gpt-4o.
  • Tools – the functions the LLM can call (exposed by our MCP server):
LibreChat “Tools + Actions” list displaying Kubernetes tools such as patch_resource, apply_yaml, get_pod_logs, list_pods, and describe_deployment with buttons to add or remove tools.

Where the tools come from

They’re registered in a tiny MCP server we ship as a Python class:

class KubernetesManagerMCP:
    """
    KubernetesManagerMCP provides methods to manage Kubernetes resources as MCP tools.
    """
    
    def __init__(self, name: str = "K8sManager", namespace: str = "default") -> None:
        """
        Initializes the KubernetesManagerMCP with the specified name and namespace.
        
        Args:
            name (str): Name for the MCP server.
            namespace (str): Kubernetes namespace to operate in.
        """
        self.k8s_manager = KubernetesManager(namespace=namespace)
        self.mcp = FastMCP(name)
        self._register_tools()
    
    def _register_tools(self) -> None:
        """Register all Kubernetes management tools with MCP."""
        
        @self.mcp.tool()
        def patch_resource(api_version: str, kind: str, name: str, patch_body: dict) -> str:
            """
            Patches a Kubernetes resource with provided patch body.

            Args:
                api_version (str): API version of the resource (e.g., "apps/v1").
                kind (str): Kind of the resource (e.g., "Deployment", "Service").
                name (str): Name of the resource to patch.
                patch_body (dict): JSON patch body defining the changes.
                
            Returns:
                str: Success message.
            """
            return self.k8s_manager.patch_resource(api_version, kind, name, patch_body)

        @self.mcp.tool()
        def apply_yaml(yaml_file_path: str) -> str:
            """
            Applies a YAML configuration to the Kubernetes cluster.

            Args:
                yaml_file_path (str): Path to the YAML file to apply.
                
            Returns:
                str: Success message.
            """
            return self.k8s_manager.apply_yaml(yaml_file_path)


            Returns:
                str: Pod logs.
            """

.....

Full source: mcp_k8s.py in the repo.

Connecting LibreChat to UI

    version: 1.2.1
    mcpServers:
      k8s:
        url: http://k8s-manager.default.svc.cluster.local:8000/sse

(See the complete file at lines 42-46 of k8s-libre.yaml).

Because we are hosting it in the same cluster LibreChat is running, we don’t need to expose this which make things a little more secure! Once MCP is connected to LibreChat – you should assign next tools to any of your agents:

Modal window titled “Agent Tools” showing individual tool cards for patch_resource, apply_yaml, get_pod_logs, describe_pod, and others, each with a wrench icon and a Remove button.

Outcome

See it yourself: the agent passed the Minimum-Functionality Test and fixed the root cause in seconds – conversation screenshot below.

Chat transcript between the user and “Kubernetes management agent.” The agent lists broken pods in ImagePullBackOff, identifies an image-name typo, asks permission to patch, receives “yes,” and confirms it ran patch_resource to update the Deployment.

Summary

Healthy deployment state – mission accomplished!

You just saw the agent patch a broken Deployment and bring every pod to running.

Headlamp dashboard after the fix. In the default namespace, four Deployments—“web-app-error” (6 pods), “mongo” (4), “k8s-manager” (3), and “librechat” (3)—all show green circular arrows. The Pods table on the right lists each pod with Status = Running, confirming the image-name typo has been patched.

If you want to try it yourself, the full repo lives here → https://github.com/kyryl-opens-ml/mcp-repl/tree/main/examples/infra_librechat.

Related tools in the same space:

The cool part: you can mix-and-match LibreChat + MCP however you like – entirely inside a single local cluster. Have fun hacking!

1 thought on “Build a Self-Healing K8s Agent with LibreChat + MCP”

  1. Pingback: AI Engineering Product Template for Reliable Coding Agents

Leave a Reply

Scroll to Top

Discover more from Kyryl Opens ML

Subscribe now to keep reading and get access to the full archive.

Continue reading