openrouter/z-ai/glm-5.1 wins on price/reliability/quality balance. Production deployed at $0.059 per conversation. Caching pass-through via OpenRouter works for GLM and DeepSeek-direct routes (~66% hit rate). DeepSeek-via-:nitro failed: empty-choices errors mid-loop and no caching. GLM-4.6 occasionally (~1 in 5 runs) acknowledges contact but doesn't dispatch the lead in-conversation — the follow-up timer recovers it, so this is observed behavior rather than blocking.
Methodology
- One conversation pattern reproduced across all candidates: client asks for property appraisal → narrows to inheritance + Барнаул → asks document questions → uploads passport → shares phone → bot dispatches lead.
- Each session yields ~50K–90K input tokens, ~1–2K output tokens, 10–15 LLM round-trips.
- Token counts are exact — pulled from the API's
usagefield, written into the trace asusage: in=N out=N cached_read=N cache_write=Nlines. - Costs are exact for OpenRouter routes — pulled from
usage.costin the chat-completion response. Direct OpenAI/Anthropic costs are computed deterministically from token counts.
Results
| Model | $/conv | Per 1k | Lead reliability | Reasoning | Notes |
|---|---|---|---|---|---|
openrouter/z-ai/glm-5.1 (production) |
$0.059 | $59 | clean (1/1) | very good (KB-grounded) | Caching passes through OpenRouter, ~66% hit rate |
openrouter/z-ai/glm-4.6 |
$0.012–0.013 | $13 | observed: ~1 in 5 runs misses send_lead in-conversation; timer recovery covers it |
very good (slightly behind 5.1 / GPT-5 in our flow) | Cheaper than 5.1; lead-loss is recoverable, not blocking |
openai/gpt-5.4 |
$0.063 | $63 | clean (1/1) | very good | OpenAI auto-caching; cache reads at 10% of input rate |
anthropic/claude-sonnet-4.6 (cached) |
$0.118 | $118 | clean (1/1) | very good | Explicit cache_control markers; 90% off cached reads |
openrouter/deepseek/deepseek-v4-pro:nitro |
flaky | n/a | testing failed | n/a | Empty-choices errors mid-loop; no caching pass-through |
Caching findings
| Provider | Mechanism | Discount | OpenRouter pass-through |
|---|---|---|---|
| OpenAI (GPT-5 family) | automatic | 90% | yes |
| OpenAI (GPT-4.1 / 4o) | automatic | 75% / 50% | yes |
| Anthropic | explicit cache_control | 90% | yes (markers passed through) |
| DeepSeek | automatic | ~90% | only via DeepSeek's own API; not via Together/Fireworks |
| Z.AI GLM | automatic | ~80% | yes (~66% effective rate observed for GLM-5.1) |
| Moonshot Kimi / Qwen | varies | unknown | undocumented |
For our agent, the stable cacheable system prefix is ~3,500–4,300 tokens (engine preamble + project Prompt1 + tools schema). The volatile suffix (per-turn state snapshot) doesn't cache, by design — it's split into a separate system block so its variation doesn't bust the prefix.
OpenRouter cache-miss pattern
Caching pass-through via OpenRouter is non-deterministic because OpenRouter load-balances across multiple upstreams (Together, Fireworks, DeepInfra, the model author's own API). Different upstreams have different caches; if a call lands on a different upstream than the previous one, the prefix is cold there and pays full rate.
Observed in production GLM-5.1 traces: typically one cache miss per session (cost variance ~17%). Net effect: paying $59/1k vs the cache-perfect floor of ~$50/1k. Acceptable for now; addressable later via OpenRouter's provider.order request field if it becomes a pattern.
Per-turn cost characteristics (production GLM-5.1, 6-turn session)
turn user msg $/turn
─────────────────────────────────────────────────────────────
1 "нужна оценка квартиры" $0.0108
2 "барнаул, оформить наследство" $0.0102
3 "а где взять ергн?" $0.0105
4 <file: passport.pdf> $0.0000 (file_ack)
5 "паспорт" $0.0120
6 "понял. да, номер актуальный" → send_lead $0.0157
─────────────────────────────────────────────────────────────
TOTAL $0.0591
Reliability observations
- GPT-5.4 / Sonnet — both 100% reliable in tested sessions. Cleanly call
send_leadon the turn when contact + 3 key params are first satisfied. - GLM-5.1 — 100% reliable in 2/2 tested sessions. Clean dispatch. Reasoning quality matches GPT-5.4.
- GLM-4.6 — observed in roughly 1 of 5 runs: bot gets the contact, says "we'll call you", but doesn't invoke
send_leadin-conversation. The follow-up timer picks it up (cold + last interval dispatches the lead automatically), so this is a recoverable behavior rather than a blocker. Worth knowing about, not a reason to avoid the model on its own. - DeepSeek-via-
:nitro— empty-choices errors in mid-loop calls. Even the new transient-error retry didn't always help.
Recommendations
| Use case | Model |
|---|---|
| Production default | openrouter/z-ai/glm-5.1 |
| Premium tier (paying ~7×) | anthropic/claude-sonnet-4.6 with caching |
| Mid-tier alternative | openai/gpt-5.4 |
| Cheaper alternative | openrouter/z-ai/glm-4.6 — about 5× cheaper than 5.1, with the in-conversation lead-loss caveat above (timer recovery handles it) |
| Avoid | any :nitro variant (no caching pass-through) |
Reproducibility
To re-run this investigation:
# Edit config.dev.json's models.toolcaller, walk a 5-turn session via CLI:
bin/aichat-cli --config config.dev.json --project abo22
# Sum cost from the trace:
grep '^usage:' traces/1/<latest>.log | sed 's/.*cost=\$\([0-9.]*\).*/\1/' | paste -sd+ | bc
The usage: and cost= lines are written by internal/agent/toolagent.go (debugDumpBlocks + the per-call usage line). Token counts are exact; OpenRouter cost= is exact for routes that return it.