Toolcaller Pricing Investigation, 2026-04-25

Result: openrouter/z-ai/glm-5.1 wins on price/reliability/quality balance. Production deployed at $0.059 per conversation. Caching pass-through via OpenRouter works for GLM and DeepSeek-direct routes (~66% hit rate). DeepSeek-via-:nitro failed: empty-choices errors mid-loop and no caching. GLM-4.6 occasionally (~1 in 5 runs) acknowledges contact but doesn't dispatch the lead in-conversation — the follow-up timer recovers it, so this is observed behavior rather than blocking.

Methodology

One conversation pattern reproduced across all candidates: client asks for property appraisal → narrows to inheritance + Барнаул → asks document questions → uploads passport → shares phone → bot dispatches lead.
Each session yields ~50K–90K input tokens, ~1–2K output tokens, 10–15 LLM round-trips.
Token counts are exact — pulled from the API's usage field, written into the trace as usage: in=N out=N cached_read=N cache_write=N lines.
Costs are exact for OpenRouter routes — pulled from usage.cost in the chat-completion response. Direct OpenAI/Anthropic costs are computed deterministically from token counts.

Results

Model	$/conv	Per 1k	Lead reliability	Reasoning	Notes
`openrouter/z-ai/glm-5.1` (production)	$0.059	$59	clean (1/1)	very good (KB-grounded)	Caching passes through OpenRouter, ~66% hit rate
`openrouter/z-ai/glm-4.6`	$0.012–0.013	$13	observed: ~1 in 5 runs misses `send_lead` in-conversation; timer recovery covers it	very good (slightly behind 5.1 / GPT-5 in our flow)	Cheaper than 5.1; lead-loss is recoverable, not blocking
`openai/gpt-5.4`	$0.063	$63	clean (1/1)	very good	OpenAI auto-caching; cache reads at 10% of input rate
`anthropic/claude-sonnet-4.6` (cached)	$0.118	$118	clean (1/1)	very good	Explicit `cache_control` markers; 90% off cached reads
`openrouter/deepseek/deepseek-v4-pro:nitro`	flaky	n/a	testing failed	n/a	Empty-choices errors mid-loop; no caching pass-through

Caching findings

Provider	Mechanism	Discount	OpenRouter pass-through
OpenAI (GPT-5 family)	automatic	90%	yes
OpenAI (GPT-4.1 / 4o)	automatic	75% / 50%	yes
Anthropic	explicit `cache_control`	90%	yes (markers passed through)
DeepSeek	automatic	~90%	only via DeepSeek's own API; not via Together/Fireworks
Z.AI GLM	automatic	~80%	yes (~66% effective rate observed for GLM-5.1)
Moonshot Kimi / Qwen	varies	unknown	undocumented

For our agent, the stable cacheable system prefix is ~3,500–4,300 tokens (engine preamble + project Prompt1 + tools schema). The volatile suffix (per-turn state snapshot) doesn't cache, by design — it's split into a separate system block so its variation doesn't bust the prefix.

OpenRouter cache-miss pattern

Caching pass-through via OpenRouter is non-deterministic because OpenRouter load-balances across multiple upstreams (Together, Fireworks, DeepInfra, the model author's own API). Different upstreams have different caches; if a call lands on a different upstream than the previous one, the prefix is cold there and pays full rate.

Observed in production GLM-5.1 traces: typically one cache miss per session (cost variance ~17%). Net effect: paying $59/1k vs the cache-perfect floor of ~$50/1k. Acceptable for now; addressable later via OpenRouter's provider.order request field if it becomes a pattern.

Per-turn cost characteristics (production GLM-5.1, 6-turn session)

turn  user msg                                      $/turn
─────────────────────────────────────────────────────────────
  1   "нужна оценка квартиры"                       $0.0108
  2   "барнаул, оформить наследство"                 $0.0102
  3   "а где взять ергн?"                            $0.0105
  4   <file: passport.pdf>                           $0.0000  (file_ack)
  5   "паспорт"                                      $0.0120
  6   "понял. да, номер актуальный" → send_lead      $0.0157
─────────────────────────────────────────────────────────────
TOTAL                                                $0.0591

Reliability observations

GPT-5.4 / Sonnet — both 100% reliable in tested sessions. Cleanly call send_lead on the turn when contact + 3 key params are first satisfied.
GLM-5.1 — 100% reliable in 2/2 tested sessions. Clean dispatch. Reasoning quality matches GPT-5.4.
GLM-4.6 — observed in roughly 1 of 5 runs: bot gets the contact, says "we'll call you", but doesn't invoke send_lead in-conversation. The follow-up timer picks it up (cold + last interval dispatches the lead automatically), so this is a recoverable behavior rather than a blocker. Worth knowing about, not a reason to avoid the model on its own.
DeepSeek-via-:nitro — empty-choices errors in mid-loop calls. Even the new transient-error retry didn't always help.

Recommendations

Use case	Model
Production default	`openrouter/z-ai/glm-5.1`
Premium tier (paying ~7×)	`anthropic/claude-sonnet-4.6` with caching
Mid-tier alternative	`openai/gpt-5.4`
Cheaper alternative	`openrouter/z-ai/glm-4.6` — about 5× cheaper than 5.1, with the in-conversation lead-loss caveat above (timer recovery handles it)
Avoid	any `:nitro` variant (no caching pass-through)

Reproducibility

To re-run this investigation:

# Edit config.dev.json's models.toolcaller, walk a 5-turn session via CLI:
bin/aichat-cli --config config.dev.json --project abo22

# Sum cost from the trace:
grep '^usage:' traces/1/<latest>.log | sed 's/.*cost=\$\([0-9.]*\).*/\1/' | paste -sd+ | bc

The usage: and cost= lines are written by internal/agent/toolagent.go (debugDumpBlocks + the per-call usage line). Token counts are exact; OpenRouter cost= is exact for routes that return it.