← Retour à la salle de presse
Aug 24, 2025

LLM Cost Engineering: Tokens, Caching, and Hybrid Models

A hands-on playbook from Silyze for cutting LLM spend 30–70% without hurting quality, by measuring tokens, caching aggressively, and routing to the right models.

Spend less, ship faster, without turning your product into Mad Libs. This is our field guide to measuring, reducing, and routing LLM usage so you can scale features and margins together.

  • Measure first. Cost = Σ(in_tokens × in_price + out_tokens × out_price). Instrument per request and per feature.
  • Trim tokens. Shorten system prompts, chunk RAG smaller, drop baggage (logs/IDs), use structured outputs.
  • Cache everything. Exact + fuzzy prompt caches, embedding reuse, pre-computed tool calls, deterministic summaries.
  • Route smartly. 70–90% of traffic can go to smaller/cheaper models; escalate only when needed.
  • Batch & schedule. Night jobs and bulk embeddings save 2–5× on throughput.
  • Guardrails & evals. Savings stick when quality is measured and enforced.

1) Instrumentation: know your tokens

Track these per request:

  • input_tokens / output_tokens
  • model, temperature, top_p, max_tokens
  • latency_ms, cache_hit (bool, and cache layer)
  • feature (business dimension), tenant, locale
  • quality scores (auto-eval or human)

Cost formula

cost = input_tokens * price_in + output_tokens * price_out

Put the formula in a shared sheet + a dashboard. If the team can’t see costs by feature, nobody owns them.

2) Token hygiene: fewer in, fewer out

System prompt diet

  • Replace paragraphs with bullets; pull policy snippets via RAG at runtime.
  • Avoid repeated boilerplate across calls, centralize and reference.
  • Declare output schema (JSON/YAML) to cap verbose answers.

User/context trimming

  • Drop IDs, timestamps, or logs not used by the model.
  • Clip long threads using recency + salience (keep the last N + top K).
  • Summarize earlier steps into compact state objects.

RAG that doesn’t bloat

  • Chunk size: 200–500 tokens with semantic titles.
  • Top-K: start at 3; tune by evals.
  • Filters: locale, product version, date; avoid dumping whole manuals.
  • Citation budget: include only the passages you cite.

Output control

  • max_tokens per task type; don’t let the model ramble.
  • Use function calling / structured outputs to avoid wordy prose.

3) Caching: your best friend

Cache LayerWhat you storeKeying strategyNotes
Exact prompt cacheFinal prompt → completionSHA-256 of normalized promptGreat for deterministic tasks (classification, extraction)
Fuzzy prompt cacheSimilar prompts → canonical resultEmbedding + cosine ≥ τFor near-duplicate queries; cap drift with thresholds
Embedding cacheText → vectorHash(text)Reuse vectors across features & tenants
Tool result cacheAPI/tool outputsInput paramsE.g., currency rates, product catalogs, doc fetch
Summary cacheCanonical short formResource/versionPrecompute doc summaries for RAG
KV cache (runtime)Model key/attention statesModel/runtime featureDepends on provider/runtime; best for long-gen chats

Normalization recipe (pseudocode)

function normalizePrompt(p: Prompt): string {
  return JSON.stringify({
    role: p.role,
    sys: trimWhitespace(p.system).toLowerCase(),
    instr: trimWhitespace(p.instruction),
    // Sort keys so order changes don't bust the cache:
    inputs: sortObjectKeys(p.inputs),
    // Strip volatile fields:
    meta: { locale: p.meta.locale, version: p.meta.version },
  });
}

Fuzzy cache guardrails

  • Threshold τ = 0.92–0.96 for embeddings.
  • Validate domain constraints (dates, locale, product version).
  • Store provenance so you can audit misses.

4) Hybrid models: right-size the brain

Routing policy (example)

  1. Easy (classification, dedupe, extraction on known templates) → small model (fast/cheap).
  2. Medium (grounded Q&A with good KB) → mid-tier.
  3. Hard/novel (creative synthesis, missing context, low confidence) → top model.
  4. Fail-safes → human-in-the-loop or refusal.

Signals to route on

  • Confidence (self-reported + retrieval score)
  • Coverage (did RAG find enough?)
  • Task complexity (schema known? number of fields?)
  • Risk (billing/legal → escalate or refuse)

Distill and specialize

  • Distill frequent tasks into small finetunes or adapters.
  • Keep a style/tone adapter per brand or locale.
  • Periodically re-distill from transcripts + gold data.

5) Batch, schedule, and stream

  • Bulk embeddings: 5–10× cheaper at high throughput; backoff & retry queues.
  • Nightly jobs: long summaries, index rebuilds, document OCR.
  • Streaming outputs: show tokens as they land; cut perceived latency and allow early stop UI.
  • Speculative drafts: precompute likely continuations for common prompts (where infra supports).

6) Observability: cost + quality together

  • Traces: per-step spans (RAG fetch, prompt build, call, post-process).
  • Feature dashboards: tokens, cache hit rate, route decisions, latency, error types.
  • Unit evals: weekly golden sets; auto-open issues for regressions.
  • Drift: alert on context length growth, prompt bloat, or cache thrash.
  • Budgets & alerts: per-tenant and per-feature caps with throttles.

Quality gates (example)

- Critical error rate < 1%
- Hallucination rate < 2% (grounded Q&A)
- JSON validity > 99.5%
- Citation overlap ≥ 0.8

7) Governance & safety

  • Refusal rules for restricted topics/actions.
  • PII redaction before logging/caching.
  • Versioned prompts (immutable IDs) and rollbacks.
  • Per-tenant isolation for caches if data sensitivity requires.
  • Data retention windows for caches; forget on request (GDPR).

8) Cost levers cheat sheet

LeverTypical SaveRiskHow to ship
System prompt trim10–25%LowShorten, reference policy via RAG
RAG top-K tuning5–15%LowSmaller chunks + filters
Output token caps5–20%MediumEnforce JSON schemas
Exact cache20–60% on repeat tasksLowNormalize & hash
Fuzzy cache10–40%Mediumτ threshold + guardrails
Routing (small→big)20–50%MediumConfidence + coverage signals
Distillation10–30%MediumRebuild quarterly
Batch embeddings2–5× throughputLowQueue + backoff
Night jobsLatency reliefLowSchedule non-urgent work

9) Example: bringing it together (Node/TS-style pseudocode)

const req = buildPrompt(ctx);
const key = hash(normalizePrompt(req));

let out = cache.get(key) ?? fuzzyCache.match(req);
if (!out) {
  const route = chooseModelRoute({
    coverage: rag.coverageScore,
    confidence: selfCheck(req),
    risk: riskTag(ctx),
  });
  out = await callLLM(route.model, req);
  cache.set(key, out, { ttl: route.ttl });
}

const validated = validateJSON(out, schema);
if (!validated.ok) return escalate(req, out);

return {
  value: validated.value,
  cost: estimateCost(out, route.model),
  tokens: { in: req.toks, out: out.toks },
  provenance: { passages: rag.passages, model: route.model },
};

10) Rollout plan (90 days)

Week 1–2: Instrument tokens & costs, add exact cache, trim prompts.
Week 3–4: RAG chunk/filters, output schemas, fuzzy cache v1.
Week 5–6: Routing policy + small model path; start batch embeddings.
Week 7–8: Distill frequent tasks; nightly jobs; dashboards.
Week 9–12: Tighten thresholds, add budgets, quarterly distill plan.

FAQ

How do we keep quality while cutting cost?
Measure it. Use confidence/coverage routing, JSON schemas, and grounded RAG. Promote only what passes evals.

Is fuzzy caching safe?
Yes, if you apply a high similarity threshold and validate domain constraints (date/locale/version).

When should we finetune?
When a task repeats at scale with stable schemas and style. Distill from your best outputs.

Talk to Silyze

Need help implementing this stack (observability, caches, routing, distillation) in your codebase? support@silyze.com, we’ll help you cut spend 30–70% while keeping quality green.

Partager