Training Agents with Patches

Improve your agent iteratively by evaluating results and patching its configuration - goal and tools.

The Training Loop

┌──────────────┐     ┌──────────┐     ┌──────────────┐     ┌──────────────┐
│  Run Agent   │ ──▶ │ Evaluate │ ──▶ │ PATCH Agent  │ ──▶ │  Run Again   │
│              │     │  Output  │     │  goal/tools  │     │              │
│              │     │          │     │              │     │              │
└──────────────┘     └──────────┘     └──────────────┘     └──────────────┘

Unlike fine-tuning a model, training an agent is about refining its instructions and tool set based on observed behavior. Every improvement is a simple PATCH call.

What you can patch

Field	Description	Use case
`goal`	Agent instructions / prompt	Refine behavior, add edge cases
`tools`	Attached MCP tools	Add/remove capabilities
`name`	Display name	Versioned naming
`status`	Agent lifecycle	Activate/archive

Pattern 1: Refine instructions based on output

The simplest training loop - run, evaluate, patch the goal.

from flymyai import AgentClient

client = AgentClient(api_key="fly-***")

# Create tools first - each tool has an integer ID
web_search = client.tools.create(mcp_tool="tavily")

agent = client.agents.create(
    name="Email Drafter",
    goal="Write a professional email about {{ topic }}.",
    tools=[web_search.id],
)

# Run and evaluate
run = client.runs.create(agent_id=agent.id)
result = client.runs.wait(run.id)

# Too formal? Patch the goal
client.agents.update(agent.id,
    goal="""Write a professional but friendly email about {{ topic }}.
    Keep paragraphs short (2-3 sentences). Use a warm sign-off.
    Avoid corporate jargon.""",
)

# Run again - agent now uses refined instructions
run2 = client.runs.create(agent_id=agent.id)
result2 = client.runs.wait(run2.id)

Pattern 2: Automated training loop

Evaluate output programmatically and patch until quality meets a threshold.

def evaluate_output(output: dict) -> dict:
    """Returns {"score": 0-1, "issues": [...]}"""
    issues = []
    if len(output.get("summary", "")) < 100:
        issues.append("Summary too short")
    if not output.get("sources"):
        issues.append("No sources cited")
    score = 1.0 - (len(issues) * 0.3)
    return {"score": max(0, score), "issues": issues}


web_search = client.tools.create(mcp_tool="tavily")  # tool.id is an integer

agent = client.agents.create(
    name="Research Bot v1",
    goal="Research {{ topic }} and return a summary with sources.",
    tools=[web_search.id],
)

for iteration in range(5):
    run = client.runs.create(agent_id=agent.id)
    result = client.runs.wait(run.id, timeout=600)

    if result.status != "completed":
        print(f"  Run failed: {result.error}")
        break

    eval_result = evaluate_output(result.output)
    print(f"  Iteration {iteration + 1}: score={eval_result['score']:.1f}")

    if eval_result["score"] >= 0.8:
        print("  ✓ Quality threshold reached")
        break

    # Refine the goal based on issues
    current = client.agents.get(agent.id)
    corrections = "; ".join(eval_result["issues"])
    client.agents.update(agent.id,
        goal=current.goal + f"\n\nCorrection (iteration {iteration + 1}): {corrections}",
    )

Pattern 3: Tool evolution

Start simple, add tools based on what the agent struggles with.

# Create tools - each returns an object with an integer .id
web_search = client.tools.create(mcp_tool="tavily")
financial_api = client.tools.create(mcp_tool="financial_datasets")

# Start with just web search
agent = client.agents.create(
    name="Analyst",
    goal="Analyze {{ company }} and produce a report.",
    tools=[web_search.id],
)

run = client.runs.create(agent_id=agent.id)
result = client.runs.wait(run.id)

# Agent couldn't access financial data - add a specialized tool
if "financial data unavailable" in str(result.output):
    client.agents.update(agent.id,
        tools=[web_search.id, financial_api.id],
    )

Pattern 4: Review run logs before patching

Use execution logs to understand what the agent did before making corrections.

run = client.runs.create(agent_id=agent.id)

# Stream live
for event in client.runs.stream_events(run.id, timeout=600):
    if event.type == "tool_called":
        print(f"  Tool: {event.message}")
        print(f"  Args: {event.data}")
    elif event.type == "tool_call_exception":
        print(f"  ✗ Error: {event.message}")

# Or inspect after completion
result = client.runs.get(run.id)
for log in result.logs:
    print(f"[{log.type}] {log.message}")
    if log.data:
        print(f"  data: {log.data}")

Pattern 5: A/B testing agents

Create variants, run them on the same input, compare.

variants = [
    {"name": "v1-concise", "goal": "Give a brief 2-paragraph answer about {{ topic }}."},
    {"name": "v2-detailed", "goal": "Give a thorough analysis of {{ topic }} with examples."},
    {"name": "v3-structured", "goal": "Analyze {{ topic }}. Return JSON with: summary, pros, cons, recommendation."},
]

web_search = client.tools.create(mcp_tool="tavily")

results = {}
for v in variants:
    agent = client.agents.create(**v, tools=[web_search.id])
    run = client.runs.create(agent_id=agent.id)
    result = client.runs.wait(run.id)
    results[v["name"]] = result.output
    client.agents.delete(agent.id)

# Compare outputs
for name, output in results.items():
    print(f"\n--- {name} ---")
    print(str(output)[:200])

Best practices

Small patches - change one thing at a time so you can attribute improvement
Log everything - use stream_events() to understand why the agent behaved a certain way
Version your agents - use descriptive names (v1-concise, v2-with-sources) or clone agents for A/B tests
Automate evaluation - write scoring functions for your use case and loop until threshold
Review before patching - always check logs first; the issue might be a tool failure, not a prompt problem

The Training Loop​

What you can patch​

Pattern 1: Refine instructions based on output​

Pattern 2: Automated training loop​

Pattern 3: Tool evolution​

Pattern 4: Review run logs before patching​

Pattern 5: A/B testing agents​

Best practices​