Skip to content
Orqen Docs

Code Examples

Benchmark your agent

A standalone script that runs the same query twice — once direct to your provider, once via Orqen — and compares input token counts. One dependency (httpx), no mocking, numbers come straight from the provider response.

Expected output

A 31-tool agent, weather query, claude-haiku-4-5. Your numbers will vary by model and prompt — the ratio is what matters.

Tools available: 31
Model:           claude-haiku-4-5-20251001
Prompt:          What is the current weather in Paris, France? Should I bring an umbrella today?

Running direct (api.anthropic.com)...  done  1,842ms
Running via Orqen (api.orqen.app)...   done    967ms

──────────────────────────────────────────────────
                      DIRECT    VIA ORQEN    DELTA
──────────────────────────────────────────────────
Input tokens       8,142        1,207   −6,935 (−85%)
Output tokens         24           24         +0
Tools forwarded       31            1
Orqen overhead         —        <20ms
──────────────────────────────────────────────────
Answer match: ✓
Response kind:       direct=tool_call  orqen=tool_call

At claude-haiku-4-5 pricing ($0.80/M input):
  Direct:     $0.00651 per call
  Via Orqen:  $0.00097 per call
  Savings:    $0.00555 per call  (~$5.55 per 1,000 calls)

Answer match: ✓ means both calls returned the same tool call. Output tokens are unchanged — Orqen only trims input. If the script prints a recall miss warning, check that tool's description and re-run.

Quick start

pip install httpx

export ORQEN_API_KEY=sk-orq-...        # orqen.app → Settings
export ANTHROPIC_API_KEY=sk-ant-...    # your Anthropic key (direct baseline)

python benchmark.py

# Try a different prompt or model:
python benchmark.py --prompt "Book a flight from London to Tokyo"
python benchmark.py --model claude-sonnet-4-6

What it measures

Input tokens — the only dimension Orqen optimises. Both calls use the same model; the delta comes entirely from Orqen routing to a smaller, relevant tool subset rather than forwarding all 31 schemas.

Output tokens — should be equal. If Orqen's output is significantly larger and the response kind flips from tool_call to text, the script flags a recall miss: the needed tool was pruned and the model answered in prose instead. That hurts correctness, not just cost — fix the tool description and re-run.

Answer match — both calls should select the same tool. A mismatch is surfaced explicitly so you don't mistake a correctness difference for a token savings win.

The script

benchmark.py
#!/usr/bin/env python3
"""Orqen benchmark — reproducible token savings measurement.

Runs the same query against a 31-tool agent twice:
  1. Direct: straight to your LLM provider with all 31 tool schemas
  2. Via Orqen: Orqen routes the request, forwards only the relevant tool

Both calls go to the same model. Input token counts come from the provider response.

Prerequisites
-------------
    pip install httpx
    export ORQEN_API_KEY=sk-orq-...       # orqen.app -> Settings
    export ANTHROPIC_API_KEY=sk-ant-...   # for the direct baseline

Usage
-----
    python benchmark.py
    python benchmark.py --model claude-sonnet-4-6
    python benchmark.py --prompt "Book a flight from London to Tokyo"
"""

from __future__ import annotations
import argparse, os, time
from typing import Any

try:
    import httpx
except ImportError:
    raise SystemExit("pip install httpx")


# ── Tool list (31 tools — only 1 is relevant for a weather query) ─────────────

def _fn(name, desc, props, req=None):
    return {"name": name, "description": desc,
            "input_schema": {"type": "object", "properties": props, "required": req or []}}

TOOLS = [
    _fn("get_current_weather",    "Get current live weather for a city.",
        {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, ["city"]),
    _fn("get_weather_forecast",   "Get a weather forecast for the next few days.",
        {"city": {"type": "string"}, "days": {"type": "integer"}}, ["city"]),
    _fn("geocode_city",           "Find lat/lon, country, and timezone for a city.",
        {"city": {"type": "string"}}, ["city"]),
    _fn("calculate_compound_interest", "Calculate compound interest.",
        {"principal": {"type": "number"}, "annual_rate_percent": {"type": "number"}, "years": {"type": "number"}},
        ["principal", "annual_rate_percent", "years"]),
    _fn("calculate_loan_payment", "Calculate a fixed monthly loan payment.",
        {"principal": {"type": "number"}, "annual_rate_percent": {"type": "number"}, "years": {"type": "number"}},
        ["principal", "annual_rate_percent", "years"]),
    _fn("convert_temperature",    "Convert a temperature between Celsius and Fahrenheit.",
        {"value": {"type": "number"}, "from_unit": {"type": "string"}, "to_unit": {"type": "string"}},
        ["value", "from_unit", "to_unit"]),
    _fn("lookup_customer",        "Look up a customer profile by customer id.",
        {"customer_id": {"type": "string"}}, ["customer_id"]),
    _fn("get_order_status",       "Look up an e-commerce order status.",
        {"order_id": {"type": "string"}}, ["order_id"]),
    _fn("estimate_shipping_rate", "Estimate shipping cost from weight and destination.",
        {"weight_kg": {"type": "number"}, "destination_country": {"type": "string"}},
        ["weight_kg", "destination_country"]),
    _fn("search_knowledge_base",  "Search a support knowledge base.", {"query": {"type": "string"}}, ["query"]),
    _fn("summarize_incident",     "Summarize an incident from severity and notes.",
        {"severity": {"type": "string"}, "service": {"type": "string"}, "notes": {"type": "string"}},
        ["severity", "service", "notes"]),
    _fn("calculate_sla_deadline", "Calculate an SLA deadline.",
        {"created_at": {"type": "string"}, "sla_hours": {"type": "number"}}, ["created_at", "sla_hours"]),
    _fn("validate_email_address", "Validate the format of an email address.", {"email": {"type": "string"}}, ["email"]),
    _fn("normalize_phone_number", "Normalize a phone number to E.164.", {"phone": {"type": "string"}}, ["phone"]),
    _fn("lookup_airport",         "Look up an airport by IATA code.", {"iata_code": {"type": "string"}}, ["iata_code"]),
    _fn("calculate_distance_estimate", "Estimate great-circle distance between two cities.",
        {"from_city": {"type": "string"}, "to_city": {"type": "string"}}, ["from_city", "to_city"]),
    _fn("create_support_ticket",  "Create a support ticket.",
        {"subject": {"type": "string"}, "description": {"type": "string"}, "priority": {"type": "string"}},
        ["subject", "description"]),
    _fn("get_subscription_status", "Look up subscription status.", {"account_id": {"type": "string"}}, ["account_id"]),
    _fn("calculate_invoice_total", "Calculate invoice total.",
        {"amounts": {"type": "array", "items": {"type": "number"}}, "tax_rate_percent": {"type": "number"}}, ["amounts"]),
    _fn("check_inventory",        "Check inventory level for a SKU.", {"sku": {"type": "string"}}, ["sku"]),
    _fn("reserve_inventory",      "Reserve inventory.", {"sku": {"type": "string"}, "quantity": {"type": "integer"}},
        ["sku", "quantity"]),
    _fn("recommend_plan",         "Recommend a SaaS plan.",
        {"seats": {"type": "integer"}, "features": {"type": "array", "items": {"type": "string"}}}, ["seats"]),
    _fn("parse_iso_datetime",     "Parse an ISO-8601 datetime.", {"value": {"type": "string"}}, ["value"]),
    _fn("get_business_hours",     "Return business hours for a region.",
        {"region": {"type": "string", "enum": ["us", "eu", "apac"]}}, ["region"]),
    _fn("calculate_tax_estimate", "Estimate tax.",
        {"amount": {"type": "number"}, "region": {"type": "string"}}, ["amount", "region"]),
    _fn("search_product_catalog", "Search a product catalog.", {"query": {"type": "string"}}, ["query"]),
    _fn("get_refund_policy",      "Return a refund policy.", {"category": {"type": "string"}}, ["category"]),
    _fn("classify_sentiment",     "Classify text sentiment.", {"text": {"type": "string"}}, ["text"]),
    _fn("detect_language",        "Detect the language of a text snippet.", {"text": {"type": "string"}}, ["text"]),
    _fn("redact_pii",             "Redact emails and phone numbers from text.", {"text": {"type": "string"}}, ["text"]),
    _fn("schedule_follow_up",     "Schedule a follow-up.",
        {"days_from_now": {"type": "integer"}, "topic": {"type": "string"}}, ["days_from_now", "topic"]),
]

PRICING = {
    "claude-haiku-4-5-20251001": 0.80,
    "claude-haiku-4-5":          0.80,
    "claude-sonnet-4-6":         3.00,
    "claude-opus-4-7":           5.00,
    "claude-3-5-haiku-20241022": 0.80,
    "gpt-4o":                    2.50,
    "gpt-4o-mini":               0.15,
}


# ── API call ──────────────────────────────────────────────────────────────────

def _anthropic_call(base_url, api_key, model, prompt, tools, label, timeout):
    print(f"Running {label} ({base_url.replace('https://', '')})...", end="  ", flush=True)
    t0 = time.perf_counter()
    with httpx.Client(timeout=timeout) as client:
        resp = client.post(
            f"{base_url.rstrip('/')}/v1/messages",
            headers={"x-api-key": api_key, "anthropic-version": "2023-06-01",
                     "content-type": "application/json"},
            json={"model": model, "max_tokens": 512, "tools": tools,
                  "messages": [{"role": "user", "content": prompt}]},
        )
    elapsed_ms = (time.perf_counter() - t0) * 1000
    if resp.status_code >= 400:
        raise SystemExit(f"HTTP {resp.status_code}: {resp.text[:400]}")
    print(f"done  {elapsed_ms:,.0f}ms")
    return resp.json(), {k.lower(): v for k, v in resp.headers.items()}, elapsed_ms


def _extract_tool(data):
    for b in data.get("content", []):
        if isinstance(b, dict) and b.get("type") == "tool_use":
            return b.get("name")
    return None

def _input_tokens(data):  return int(data.get("usage", {}).get("input_tokens") or 0)
def _output_tokens(data): return int(data.get("usage", {}).get("output_tokens") or 0)
def _kind(data):          return "tool_call" if _extract_tool(data) else "text"


# ── Report ────────────────────────────────────────────────────────────────────

def _print_report(prompt, model, d_data, o_data, o_hdrs):
    d_tok, o_tok   = _input_tokens(d_data), _input_tokens(o_data)
    d_out, o_out   = _output_tokens(d_data), _output_tokens(o_data)
    d_kind, o_kind = _kind(d_data), _kind(o_data)
    d_tool, o_tool = _extract_tool(d_data) or "none", _extract_tool(o_data) or "none"

    tools_out = o_hdrs.get("x-orqen-tools-output", "?")
    routing   = o_hdrs.get("x-orqen-routing", "")
    overhead  = o_hdrs.get("x-orqen-pipeline-ms", "<20")
    match_sym = "\u2713" if d_tool == o_tool else "\u2717"

    pct = f"-{(d_tok - o_tok) / d_tok * 100:.0f}%" if d_tok else "n/a"
    out_delta = o_out - d_out

    print()
    print(f"Tools available: {len(TOOLS)}")
    print(f"Model:           {model}")
    print(f"Prompt:          {prompt[:80]}{'...' if len(prompt) > 80 else ''}")
    print()
    W = 50
    sep = "-" * W
    print(sep)
    print(f"{'':20}{'DIRECT':>9}  {'VIA ORQEN':>9}  {'DELTA':>9}")
    print(sep)
    print(f"{'Input tokens':20}{d_tok:>9,}  {o_tok:>9,}  -{d_tok - o_tok:,} ({pct})")
    print(f"{'Output tokens':20}{d_out:>9,}  {o_out:>9,}  {out_delta:>+9,}")
    print(f"{'Tools forwarded':20}{len(TOOLS):>9}  {tools_out:>9}")
    if routing:
        print(f"{'Routing method':20}{'--':>9}  {routing:>9}")
    print(f"{'Orqen overhead':20}{'--':>9}  {overhead + 'ms':>9}")
    print(sep)
    print(f"Answer match: {match_sym}")
    print(f"Response kind:  direct={d_kind}  orqen={o_kind}")

    if d_tool != o_tool or (d_kind == "tool_call" and o_kind == "text"):
        print()
        if d_kind == "tool_call" and o_kind == "text":
            print("WARNING: recall miss — needed tool was pruned, model answered in prose.")
            print(f"  Fix: improve the description of '{d_tool}' so it ranks into the forwarded set.")
        else:
            print(f"WARNING: tool mismatch — direct={d_tool}, orqen={o_tool}.")
            print("  Compare answer correctness before reading the token delta.")

    price = next((p for k, p in PRICING.items() if k in model), None)
    if price and d_tok:
        d_cost = d_tok * price / 1_000_000
        o_cost = o_tok * price / 1_000_000
        saving = d_cost - o_cost
        print()
        print("At " + model + " pricing ($" + f"{price:.2f}" + "/M input):")
        print("  Direct:     $" + f"{d_cost:.5f}" + " per call")
        print("  Via Orqen:  $" + f"{o_cost:.5f}" + " per call")
        print("  Savings:    $" + f"{saving:.5f}" + " per call  (~$" + f"{saving * 1000:.2f}" + " per 1,000 calls)")
    print()


# ── Main ──────────────────────────────────────────────────────────────────────

def main():
    p = argparse.ArgumentParser()
    p.add_argument("--prompt",     default="What is the current weather in Paris? Should I bring an umbrella?")
    p.add_argument("--model",      default=os.environ.get("BENCHMARK_MODEL", "claude-haiku-4-5-20251001"))
    p.add_argument("--orqen-url",  default="https://api.orqen.app")
    p.add_argument("--direct-url", default="https://api.anthropic.com")
    p.add_argument("--timeout",    type=float, default=120.0)
    args = p.parse_args()

    orqen_key  = os.environ.get("ORQEN_API_KEY")  or raise_("Set ORQEN_API_KEY — free account at orqen.app")
    direct_key = os.environ.get("ANTHROPIC_API_KEY") or raise_("Set ANTHROPIC_API_KEY for the direct baseline")

    d_data, _,      d_ms = _anthropic_call(args.direct_url, direct_key, args.model, args.prompt, TOOLS, "direct",    args.timeout)
    o_data, o_hdrs, o_ms = _anthropic_call(args.orqen_url,  orqen_key,  args.model, args.prompt, TOOLS, "via Orqen", args.timeout)
    _print_report(args.prompt, args.model, d_data, o_data, o_hdrs)

def raise_(msg): raise SystemExit(msg)

if __name__ == "__main__":
    main()