TLDR: Build a self-improving production agent with Claude Code, Pydantic AI, and Logfire. This is our flow and plan for today.

What is a Self-Improving Agent?
There are so many different definitions of this, and some people would argue. My approach is practical. I’ve used this pattern so many times in real production – replacing rigid rule engines and manual review workflows – and it blows my mind every time!
You build the agent for a use case with Claude Code, you deploy it to make it available to your users, you connect a coding agent to an observability system to collect real user requests and agent traces, and based on that you generate the next version of your agent. Simple, reviewable, and with a human in control.
Let’s do an example step by step!
Build Your Agent
Let’s start simple-build a sample agent for demo purposes to see how it can improve automatically later on. Of course, we are going to use Claude to build it.

There are many different frameworks to build agentic and multi-agent systems – I recently started using Pydantic AI and it’s great. Alternatives such as CrewAI, LangGraph, OpenAI Agents SDK, Strands Agents, Google ADK, Smolagents, Mirascope are viable options as well.
For a simple demonstrative agent, Claude can one-shot it, but I still prefer plan mode.
input="Build a simple PydanticAI agent. Make sure to add tools and dependencies. Tools should hold deterministic logic, and the LLM should handle non-deterministic logic.Use the Anthropic model. Use a simple financial industry-specific example, such as investment, insurance, underwriting, etc. The agent should be multi-step. Keep the code simple (< 50 lines).PydanticAI reference (read them before build):https://ai.pydantic.dev/llms.txthttps://ai.pydantic.dev/llms-full.txtStructure the agent as a CLI and use uv run to test it. Put the code into src/agent/agent.py"claude -n "self-improving-agent" "$input" --permission-mode plan --allowedTools "WebFetch"
Agent Plan
Click to expand the agent plan
Plan: PydanticAI Loan Underwriting Agent
Context: First step of the self-improving agent pipeline. Build a simple PydanticAI agent for the financial domain (loan underwriting) that separates deterministic logic (tools) from non-deterministic logic (LLM).
File to create: src/agent/agent.py (~45 lines)
Design:
Dependencies (dataclass): LoanApplication with annual_income, total_debt, credit_score, loan_amount
Structured output (BaseModel): UnderwritingDecision with approved, max_loan, monthly_payment, reasoning
Agent: anthropic:claude-sonnet-4-6, with instrument=True for Logfire tracing
Tools (deterministic):
1. calc_dti – computes debt-to-income ratio including proposed loan
2. credit_risk – lookup table mapping credit score → risk band, max DTI, interest rate
LLM (non-deterministic): Synthesizes tool outputs to approve/deny, set max loan, calculate payment, explain reasoning
Multi-step flow: LLM calls both tools → compares actual DTI vs max allowed DTI for the credit tier → produces structured decision
CLI: 4 positional args: <income> <debt> <credit_score> <loan_amount>
import sys
from dataclasses import dataclass
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext
from dotenv import load_dotenv
load_dotenv()
@dataclass
class LoanApplication:
annual_income: float
total_debt: float
credit_score: int
loan_amount: float
class UnderwritingDecision(BaseModel):
approved: bool
max_loan: float
monthly_payment: float
reasoning: str
agent = Agent(
'anthropic:claude-sonnet-4-6',
deps_type=LoanApplication,
output_type=UnderwritingDecision,
instrument=True,
instructions="You are a loan underwriter. Use the provided tools to assess the applicant, then make a decision.",
)
@agent.tool
def calc_dti(ctx: RunContext[LoanApplication]) -> str:
"""Calculate debt-to-income ratio including the proposed loan."""
monthly_income = ctx.deps.annual_income / 12
monthly_debt = ctx.deps.total_debt / 12
dti = (monthly_debt / monthly_income) * 100
return f"DTI: {dti:.1f}% (monthly debt: ${monthly_debt:,.0f}, monthly income: ${monthly_income:,.0f})"
@agent.tool
def credit_risk(ctx: RunContext[LoanApplication]) -> str:
"""Look up credit risk category and max DTI allowed."""
score = ctx.deps.credit_score
if score >= 740: return f"Score {score}: Excellent. Max DTI: 45%. Rate: 6.5%"
if score >= 670: return f"Score {score}: Good. Max DTI: 40%. Rate: 7.2%"
if score >= 580: return f"Score {score}: Fair. Max DTI: 36%. Rate: 9.1%"
return f"Score {score}: Poor. Max DTI: 30%. Rate: 12.5%"
def main():
if len(sys.argv) != 5:
print("Usage: uv run python -m agent.agent ")
sys.exit(1)
app = LoanApplication(float(sys.argv[1]), float(sys.argv[2]), int(sys.argv[3]), float(sys.argv[4]))
result = agent.run_sync(f"Evaluate this loan application for ${app.loan_amount:,.0f}", deps=app)
print(result.output.model_dump_json(indent=2))
if __name__ == "__main__":
main()
Verification:
uv run python -m agent.agent 85000 12000 720 250000
Expected: JSON output with approved, max_loan, monthly_payment, reasoning fields.
Once you’re happy with the plan – the LLM can one-shot it easily for such a simple agent.
claude -r "self-improving-agent" "read plans/agent-plan.md and implement it"
NOTE: in real production-interaction, task breakdown, and iterations help a lot.
Deploy Your Agent
Now you need to expose the agent to your users.

This can be done in several ways – API, async processing, batch-everything depends on your context (I have a full part of my course dedicated just to this: Serving Basics, Serving Advanced). In most cases I’m connecting this to an existing system, not replacing it. For demonstration purposes, today I’ll just use it as a CLI with several hard cases.
input="Generate hard cases to stress-test the agent to the edge, structure as a CLI tool with input samples as JSON."claude -r "self-improving-agent" "$input" --permission-mode plan
Agent Test Plan
Click to expand the test plan
Plan: Hard Test Cases + Eval CLI
Context: Stress-test the loan underwriting agent with edge cases that expose LLM reasoning failures. The agent’s calc_dti tool computes DTI from existing debt only (not including the new loan), and credit_risk maps scores to tiers at boundaries 740/670/580.
10 hard cases designed to exploit boundaries, conflicting signals, and absurd loan-to-income ratios:
1. DTI exactly at 45% max for Excellent tier (100k income, 45k debt, score 745, 300k loan)
2. DTI 45.1%, just over Excellent cap – tiny overshoot LLM might round away
3. Score exactly 740 (Excellent boundary), DTI 42%
4. Score 739 (Good tier), same DTI 42% – one point flips the outcome
5. $500k income, score 520 (Poor). DTI 5% is fine, credit is terrible – conflicting signals
6. Score 790, but $28k income. DTI 7% passes, but $150k loan is 5.4x income
7. Loan is 44x income ($2M on $45k) with passing DTI (20%). Tools show green, common sense says no
8. Perfect profile, zero debt, tiny $5k loan – trivial approve
9. Score 579 (one below Fair → Poor), zero debt, $300k loan at punitive 12.5% rate
10. Triple borderline: score exactly 670, DTI 39.5%, loan 6.2x income
Eval CLI (src/agent/eval.py ~25 lines): Loads cases.json, runs each case with agent.run_sync(), prints inputs + decision summary.
uv run agent-eval 3 # run first 3 cases
uv run agent-eval # run all 10
And once again – plan looks good-one-shot it:
claude -r "self-improving-agent" "read plans/agent-tests.md and implement it"
After implementation is done-ask Claude or run the samples yourself (unless it’s already done as part of your plan)
uv run agent-eval
NOTE: in real production-samples from an LLM are better than nothing, but they should come from real users, not an LLM. Otherwise, you might be at risk of optimizing the wrong piece.
Observe Your Agent
Now the next step is to make sure your agent submits traces.

Now you have an agent, users are using it – let’s make sure we store its traces.
What are traces? Agents are basically dynamic workflows – for each input, a unique set of LLM calls, actions, and decisions. Each run is a unique trace. Believe me – you want to record those. Think of them as request traces in distributed systems. With traditional systems you monitor uptime and error rates – with agents you monitor how they think, and that’s what lets you improve them continuously.
With Pydantic AI and Logfire, you can just add 2 lines to your agent and everything will be recorded:
claude -r "self-improving-agent" "make sure my traces from agent are recorded in logfire add: logfire.configure();logfire.instrument_pydantic_ai() to core code and re-run all samples, make sure traces are in logfire"
NOTE: to use Logfire – create an account and add LOGFIRE_TOKEN as an env variable.
Now every time someone is using your agent, you can see exactly what happens:

NOTE: Logfire could be replaced with any LLM observability tool here (I use Datadog LLM observability most of the time) – for example Braintrust, Datadog LLM Observability, Langfuse.
Connect Claude Code to Traces
Make traces accessible to Claude Code:

It’s great that we have traceability and understand what’s happening, but it’s important to make sure somebody else can access the traces too. Who, you ask? Our coding agent, of course!
claude mcp add logfire --transport http https://logfire-us.pydantic.dev/mcp
This way, Claude can query your production traces directly.
Self Improve!
And now we are ready to close the loop!

claude "Analyze my Loan Underwriting Agent agent traces from Logfire for last day, use logfire mcp. Find mistakes, inefficiencies, and propose improvements to my agent - prompt, tools, flow, etc" --permission-mode plan
Recommendations to improve your agents are here, and they are based on actual data points.
Agent Improvement Plan
Click to expand the improvement plan
Loan Underwriting Agent – Improvement Plan (from Logfire Trace Analysis)
Context: Analyzed 10 Loan Underwriting Agent traces from Logfire (2026-03-29 18:47–18:48). The agent runs on claude-sonnet-4-6 via PydanticAI with 2 tools (calc_dti, credit_risk) and a one-sentence system prompt. The eval suite uses 10 edge-case loan applications designed to stress-test boundary conditions.
Core finding: A tool bug causes systematic DTI under-estimation, and the under-specified prompt lets the LLM fill gaps with hallucinated rules, inconsistent math, and contradictory outputs.
Bugs Found in Traces
Critical: calc_dti tool doesn’t include the proposed loan payment
Docstring says: “Calculate debt-to-income ratio including the proposed loan” – Implementation does: Only divides total_debt by income – completely ignores loan_amount. Impact: Every DTI reported is artificially low. Cases 1, 3, 6, 10 were APPROVED when true post-loan DTI is 60–90%.
Critical: Case 6 – approved=true contradicts its own reasoning
Agent calculated post-loan DTI = 64.8% > 45% max, recommended $139,700 max – but structured output: approved: true, max_loan: 150000
Critical: Case 10 – approved $750k loan with ~90% true post-loan DTI
DTI tool returned 39.5% (existing debt only), agent approved. Real post-loan DTI: (3950+5093)/10000 = 90.4%
Bug: Case 9 – max_loan=$0 for qualifying applicant
Zero debt, $6,250/mo income, 30% DTI cap = $1,875/mo capacity. Agent returned max_loan=0, hallucinated “minimum score 580-620” rule.
Inconsistency: monthly_payment varies wildly
Case 1: $3,750 (existing monthly debt, not loan payment) | Case 3: $1,580 (correct 30yr amortization) | Case 8: $97 (5yr term assumed) | Case 10: $3,950 (existing monthly debt again)
Inconsistency: max_loan methodology
Sometimes the requested amount, sometimes a rough guess, sometimes $0.
What Works Well
Both tools called in parallel (efficient, single LLM round-trip). Credit tier classification always correct. Denial decisions on obviously bad cases (2, 4, 5, 7) are correct. Token usage reasonable (~1,835 in / 233–536 out per run). No errors, no retries, no wasted tool calls.
Changes
1. Add helper functions (agent.py)
Two pure functions: _get_rate(score) – shared by all tools, eliminates duplicated tier logic. _monthly_payment(principal, annual_rate, years=30) – standard amortization formula.
2. Fix calc_dti tool (agent.py:34)
Rewrite to include proposed loan payment: Use _get_rate to get the interest rate, use _monthly_payment to compute proposed loan payment. Return: existing debt, proposed payment, total monthly debt, post-loan DTI, max DTI for tier.
3. Add calc_max_loan tool (agent.py, new)
Reverse amortization: given remaining DTI capacity, compute max affordable principal. Eliminates LLM guesswork for max_loan field.
4. Simplify credit_risk tool (agent.py:42)
Use _get_rate helper. Functionally identical output but no duplicated tier thresholds.
5. Expand system prompt (agent.py:30)
Replace one-sentence prompt with explicit rules: Call ALL tools before deciding. APPROVE only if post-loan DTI <= max DTI for tier. Set max_loan from calc_max_loan tool, not your own estimate. Set monthly_payment to the proposed loan payment from calc_dti, not existing debt. All loans are 30-year fixed rate. Do not invent rules not listed here. Reasoning must be consistent with structured output fields.
6. Add post_loan_dti field to UnderwritingDecision (agent.py:19)
Makes DTI auditable in structured output. Enables programmatic cross-check.
7. Add validate_decision() function (agent.py, new)
Post-hoc validation: checks monthly_payment matches amortization, checks approved vs DTI consistency. Called in main() and eval.py – safety net for any remaining LLM errors.
8. Add expected outcomes to cases.json
Add expected_approved field. With corrected DTI that includes loan payments, 9/10 cases should be DENIED (only Case 8 – trivial $5k loan – should approve).
9. Update eval.py for pass/fail reporting
Compare decision.approved against expected_approved, call validate_decision(), print pass/fail.
Files to Modify
src/agent/agent.py – Fix calc_dti, add calc_max_loan, add helpers, expand prompt, add post_loan_dti field, add validate_decision
src/agent/cases.json – Add expected_approved to each case
src/agent/eval.py – Add pass/fail checking and validation
Verification
1. Run uv run agent-eval – all 10 cases should produce correct approve/deny decisions
2. Check Logfire traces – new tool outputs should contain post-loan DTI, proposed payment, max loan
3. Verify Case 8 approved, Cases 1-7 and 9-10 denied
4. Verify no validate_decision warnings on any case
Mix them with domain expert review and notes, and you have a very solid next iteration.
Now you are back at step 1, “Build your agent,” but better, more confident, and with actual data points. You can automate this, you can review each step, you can build on top of this!
Conclusion
This is a pattern I use all the time at large scale. It’s extremely scalable and useful. The ROI for improving agents from production traces is insane – worth every token and every minute/hour of setup!
Next progression of this would be to add scoring and evaluation of traces and add these signals to coding agents as well – but that’s a topic for the next blog posts.
Code and plans are here: https://github.com/kyryl-opens-ml/self-improving-agent.