LLM Cost Engineering: Tokens, Caching, and Hybrid Models

A hands-on playbook from Silyze for cutting LLM spend 30–70% without hurting quality, by measuring tokens, caching aggressively, and routing to the right models.

Spend less, ship faster, without turning your product into Mad Libs. This is our field guide to measuring, reducing, and routing LLM usage so you can scale features and margins together.

Measure first. Cost = Σ(in_tokens × in_price + out_tokens × out_price). Instrument per request and per feature.
Trim tokens. Shorten system prompts, chunk RAG smaller, drop baggage (logs/IDs), use structured outputs.
Cache everything. Exact + fuzzy prompt caches, embedding reuse, pre-computed tool calls, deterministic summaries.
Route smartly. 70–90% of traffic can go to smaller/cheaper models; escalate only when needed.
Batch & schedule. Night jobs and bulk embeddings save 2–5× on throughput.
Guardrails & evals. Savings stick when quality is measured and enforced.

1) Instrumentation: know your tokens

Track these per request:

input_tokens / output_tokens
model, temperature, top_p, max_tokens
latency_ms, cache_hit (bool, and cache layer)
feature (business dimension), tenant, locale
quality scores (auto-eval or human)

Cost formula

cost = input_tokens * price_in + output_tokens * price_out

Put the formula in a shared sheet + a dashboard. If the team can’t see costs by feature, nobody owns them.

2) Token hygiene: fewer in, fewer out

System prompt diet

Replace paragraphs with bullets; pull policy snippets via RAG at runtime.
Avoid repeated boilerplate across calls, centralize and reference.
Declare output schema (JSON/YAML) to cap verbose answers.

User/context trimming

Drop IDs, timestamps, or logs not used by the model.
Clip long threads using recency + salience (keep the last N + top K).
Summarize earlier steps into compact state objects.

RAG that doesn’t bloat

Chunk size: 200–500 tokens with semantic titles.
Top-K: start at 3; tune by evals.
Filters: locale, product version, date; avoid dumping whole manuals.
Citation budget: include only the passages you cite.

Output control

max_tokens per task type; don’t let the model ramble.
Use function calling / structured outputs to avoid wordy prose.

3) Caching: your best friend

Cache Layer	What you store	Keying strategy	Notes
Exact prompt cache	Final prompt → completion	SHA-256 of normalized prompt	Great for deterministic tasks (classification, extraction)
Fuzzy prompt cache	Similar prompts → canonical result	Embedding + cosine ≥ τ	For near-duplicate queries; cap drift with thresholds
Embedding cache	Text → vector	Hash(text)	Reuse vectors across features & tenants
Tool result cache	API/tool outputs	Input params	E.g., currency rates, product catalogs, doc fetch
Summary cache	Canonical short form	Resource/version	Precompute doc summaries for RAG
KV cache (runtime)	Model key/attention states	Model/runtime feature	Depends on provider/runtime; best for long-gen chats

Normalization recipe (pseudocode)

function normalizePrompt(p: Prompt): string {
  return JSON.stringify({
    role: p.role,
    sys: trimWhitespace(p.system).toLowerCase(),
    instr: trimWhitespace(p.instruction),
    // Sort keys so order changes don't bust the cache:
    inputs: sortObjectKeys(p.inputs),
    // Strip volatile fields:
    meta: { locale: p.meta.locale, version: p.meta.version },
  });
}

Fuzzy cache guardrails

Threshold τ = 0.92–0.96 for embeddings.
Validate domain constraints (dates, locale, product version).
Store provenance so you can audit misses.

4) Hybrid models: right-size the brain

Routing policy (example)

Easy (classification, dedupe, extraction on known templates) → small model (fast/cheap).
Medium (grounded Q&A with good KB) → mid-tier.
Hard/novel (creative synthesis, missing context, low confidence) → top model.
Fail-safes → human-in-the-loop or refusal.

Signals to route on

Confidence (self-reported + retrieval score)
Coverage (did RAG find enough?)
Task complexity (schema known? number of fields?)
Risk (billing/legal → escalate or refuse)

Distill and specialize

Distill frequent tasks into small finetunes or adapters.
Keep a style/tone adapter per brand or locale.
Periodically re-distill from transcripts + gold data.

5) Batch, schedule, and stream

Bulk embeddings: 5–10× cheaper at high throughput; backoff & retry queues.
Nightly jobs: long summaries, index rebuilds, document OCR.
Streaming outputs: show tokens as they land; cut perceived latency and allow early stop UI.
Speculative drafts: precompute likely continuations for common prompts (where infra supports).

6) Observability: cost + quality together

Traces: per-step spans (RAG fetch, prompt build, call, post-process).
Feature dashboards: tokens, cache hit rate, route decisions, latency, error types.
Unit evals: weekly golden sets; auto-open issues for regressions.
Drift: alert on context length growth, prompt bloat, or cache thrash.
Budgets & alerts: per-tenant and per-feature caps with throttles.

Quality gates (example)

- Critical error rate < 1%
- Hallucination rate < 2% (grounded Q&A)
- JSON validity > 99.5%
- Citation overlap ≥ 0.8

7) Governance & safety

Refusal rules for restricted topics/actions.
PII redaction before logging/caching.
Versioned prompts (immutable IDs) and rollbacks.
Per-tenant isolation for caches if data sensitivity requires.
Data retention windows for caches; forget on request (GDPR).

8) Cost levers cheat sheet

Lever	Typical Save	Risk	How to ship
System prompt trim	10–25%	Low	Shorten, reference policy via RAG
RAG top-K tuning	5–15%	Low	Smaller chunks + filters
Output token caps	5–20%	Medium	Enforce JSON schemas
Exact cache	20–60% on repeat tasks	Low	Normalize & hash
Fuzzy cache	10–40%	Medium	τ threshold + guardrails
Routing (small→big)	20–50%	Medium	Confidence + coverage signals
Distillation	10–30%	Medium	Rebuild quarterly
Batch embeddings	2–5× throughput	Low	Queue + backoff
Night jobs	Latency relief	Low	Schedule non-urgent work

9) Example: bringing it together (Node/TS-style pseudocode)

const req = buildPrompt(ctx);
const key = hash(normalizePrompt(req));

let out = cache.get(key) ?? fuzzyCache.match(req);
if (!out) {
  const route = chooseModelRoute({
    coverage: rag.coverageScore,
    confidence: selfCheck(req),
    risk: riskTag(ctx),
  });
  out = await callLLM(route.model, req);
  cache.set(key, out, { ttl: route.ttl });
}

const validated = validateJSON(out, schema);
if (!validated.ok) return escalate(req, out);

return {
  value: validated.value,
  cost: estimateCost(out, route.model),
  tokens: { in: req.toks, out: out.toks },
  provenance: { passages: rag.passages, model: route.model },
};

10) Rollout plan (90 days)

Week 1–2: Instrument tokens & costs, add exact cache, trim prompts.
Week 3–4: RAG chunk/filters, output schemas, fuzzy cache v1.
Week 5–6: Routing policy + small model path; start batch embeddings.
Week 7–8: Distill frequent tasks; nightly jobs; dashboards.
Week 9–12: Tighten thresholds, add budgets, quarterly distill plan.

FAQ

How do we keep quality while cutting cost?
Measure it. Use confidence/coverage routing, JSON schemas, and grounded RAG. Promote only what passes evals.

Is fuzzy caching safe?
Yes, if you apply a high similarity threshold and validate domain constraints (date/locale/version).

When should we finetune?
When a task repeats at scale with stable schemas and style. Distill from your best outputs.

Talk to Silyze

Need help implementing this stack (observability, caches, routing, distillation) in your codebase? support@silyze.com, we’ll help you cut spend 30–70% while keeping quality green.