Toolcaller Refactor, 2026-04-25

TL;DR for the business side: The agent's internal architecture changed today. Customer experience is at least as good as before (replies are slightly more natural, dispatch logic is unchanged). Cost per conversation on the new architecture is roughly half of what cached Claude Sonnet would cost — this is a model-choice comparison, not a before/after of the old 3-stage bot. From an operations standpoint: there's now one prompt to write per project instead of three, and the model decides when to act rather than a hand-tuned heuristic.

Before vs. now

Before — 3 stages

User message arrives
   ↓
Stage 1
  rewrite query (LLM)
  3× knowledge-base search
  URL-score counter
  URL-determination LLM call
  reply (Prompt1)
   ↓
Stage 2
  fetch service docs
  extract price (LLM)
  3× parallel extractions
  reply (Prompt2)
   ↓
Stage 3
  generate summary (LLM)
  closing reply (Prompt3)
  send to CRM

4 prompts per project, hand-coded stage transitions, up to 7 LLM calls per turn.

Now — one tool-using loop

User message arrives
   ↓
Single LLM loop (max 8 iterations)
  → may search the KB (hybrid_search)
  → may save what it learned (set_state)
  → may dispatch the lead (send_lead)
  → ends with a reply

Tools:
  hybrid_search, set_state,
  get_state, send_lead

The model itself decides
when to call which tool.

1 prompt per project, model owns every decision, 2–4 LLM calls per turn typical.

Customer-facing impact

Customers should not notice a regression. Subtle improvements:

Conversations feel more natural. No rigid "Stage 1 only asks clarifying questions" wall — the bot can answer a price question right away if the question is clear.
Faster on simple turns. File uploads, "thanks", etc. now skip the full search-and-respond pipeline.
More specific replies. The model can search the knowledge base an extra time if the question warrants it, instead of always doing exactly three searches.
Lead format unchanged. The lead group still receives the same structured payload (contact, city, summary, files, history).

One known edge case: on rare turns a smaller open-source model can acknowledge the customer's contact but skip actually dispatching the lead. The follow-up timer catches these silently and dispatches the lead later. Not observed in production with GLM-5.1, the current production model.

Operations: how to write Prompt1

The single Prompt1 now drives the agent's behavior. The engine has built-in knowledge of the four tools (search, save, get-state, dispatch); the project prompt's job is to tell the model what the business does, what to gather from the customer, and how to behave.

Recommended structure (~3,500–4,500 chars)

Identity & tone (1–2 paragraphs)

WHAT TO GATHER (numbered list of fields)
  1) service type
  2) object
  3) city
  4) contact (phone or email) — required before dispatch
  5) documents — required / nice-to-have
  6) any other business-specific fields

DIALOG LOGIC (numbered steps)
  1. Identify the service type first.
  2. Once known, briefly explain value, give a price ballpark from KB.
  3. Ask for documents.
  4. Ask for contact.
  5. Dispatch when [criteria].
  6. After dispatch — short closing message.

WHEN TO DISPATCH
  - Minimum: contact + 3 key params.
  - Or: customer explicitly asks for a human.

DON'T-DOS
  - Don't quote prices not in the knowledge base.
  - Don't promise specific dates or handle payment.
  - Don't invent manager names.

How the engine maps natural language to tool calls

You write business-level instructions in plain language. The engine translates phrases into the right tool calls automatically:

Natural-language phrase in Prompt1	Engine maps to
"hand off to a specialist", "transfer the lead", "dispatch"	`send_lead`
"consult the knowledge base", "rely on facts", "check the site"	`hybrid_search`
"remember", "save", "note", "capture"	`set_state`
"city", "budget", "deadline", "service type" (anything in WHAT TO GATHER)	`set_state(extras=…)` with stable English keys

You do not need to write technical instructions like "call set_state with {contact: {method: 'phone', value: ...}}" in your prompt. The engine handles all of that. Mentioning tool names or argument schemas in Prompt1 actually hurts — it confuses the model when your wording competes with the engine's.

Common Prompt1 mistakes

Mentioning tool names or argument schemas	Engine handles these. Use plain business language: "transfer the lead", not "call `send_lead(summary)`".
Vague dispatch criteria	Be specific: "Minimum: contact + service type + object + city". Don't say "when you have enough info".
Putting price tables inline	Prices live in the knowledge base. Use `price_url` to point at the price page; the agent will look up specific prices via `hybrid_search` when asked.
Length over signal	Prompts > 5,000 chars start hurting attention. Aim for 3,500–4,500.
Forgetting the start-reply for new clients	The `/start` command's first message comes from the project's `start-reply` field, not Prompt1.

Testing changes

Update Prompt1 directly from the project's Telegram lead group with /prompt1 <text>. No bot restart needed. The full command reference is in the operations guide. After updating, just walk a representative scenario with the bot in chat and check whether it gathers the right fields and dispatches the lead at the right moment.

What changed in the codebase

Added

internal/llm/toolcaller.go — the tool-calling abstraction with two backends (Anthropic native, OpenAI-tools-spec).
internal/agent/toolagent.go — the runtime loop: build system prompt, dispatch tools, persist blocks.
migrations/012_chat_history_blocks.sql — adds JSONB column for full assistant + tool-call sequences per turn.
migrations/013_drop_3stage_remnants.sql — drops dead columns + JSONB keys.

Removed

internal/agent/stages.go (~970 lines of 3-stage flow).
Timer's separate classifyStatus and generateFollowUp LLM calls — timer pings now go through the same agent loop with a synthetic [TIMER PING] turn.
Prompt2, Prompt3, PricePrompt fields and the corresponding Telegram commands.
Per-project URL-determination thresholds.
~46 dead i18n keys.

Refactored

System prompt is now split into a stable cacheable prefix (engine preamble + project Prompt1 + tools schema) and a volatile suffix (per-turn state snapshot). Anthropic prompt caching is wired automatically; OpenAI / DeepSeek / GLM caching also benefits.
ChatState lost URLsCounter, PriceSummary, ContactRequested. Gained Extras, ClientStatus (hot/cold), LeadSummary.
Cold + last-interval timer behavior: now dispatches the lead instead of dropping the sequence (configurable later).

Pricing impact

On the new tool-calling architecture, GLM-5.1 (production, ~$0.059/conv) costs about half of cached Claude Sonnet (~$0.118/conv) at the same or better quality. To be clear: this is a model comparison on the new architecture, not a measurement of "new vs. old 3-stage bot" — the old system used a different model and different prompt layout, so a direct $/conv comparison wouldn't be apples-to-apples. See the pricing investigation for the full comparison.