Spend less, ship faster, without turning your product into Mad Libs. This is our field guide to measuring, reducing, and routing LLM usage so you can scale features and margins together.
- Measure first. Cost = Σ(in_tokens × in_price + out_tokens × out_price). Instrument per request and per feature.
- Trim tokens. Shorten system prompts, chunk RAG smaller, drop baggage (logs/IDs), use structured outputs.
- Cache everything. Exact + fuzzy prompt caches, embedding reuse, pre-computed tool calls, deterministic summaries.
- Route smartly. 70–90% of traffic can go to smaller/cheaper models; escalate only when needed.
- Batch & schedule. Night jobs and bulk embeddings save 2–5× on throughput.
- Guardrails & evals. Savings stick when quality is measured and enforced.
1) Instrumentation: know your tokens
Track these per request:
- input_tokens / output_tokens
- model, temperature, top_p, max_tokens
- latency_ms, cache_hit (bool, and cache layer)
- feature (business dimension), tenant, locale
- quality scores (auto-eval or human)
Cost formula
cost = input_tokens * price_in + output_tokens * price_out
Put the formula in a shared sheet + a dashboard. If the team can’t see costs by feature, nobody owns them.
2) Token hygiene: fewer in, fewer out
System prompt diet
- Replace paragraphs with bullets; pull policy snippets via RAG at runtime.
- Avoid repeated boilerplate across calls, centralize and reference.
- Declare output schema (JSON/YAML) to cap verbose answers.
User/context trimming
- Drop IDs, timestamps, or logs not used by the model.
- Clip long threads using recency + salience (keep the last N + top K).
- Summarize earlier steps into compact state objects.
RAG that doesn’t bloat
- Chunk size: 200–500 tokens with semantic titles.
- Top-K: start at 3; tune by evals.
- Filters: locale, product version, date; avoid dumping whole manuals.
- Citation budget: include only the passages you cite.
Output control
max_tokens
per task type; don’t let the model ramble.- Use function calling / structured outputs to avoid wordy prose.
3) Caching: your best friend
Cache Layer | What you store | Keying strategy | Notes |
---|---|---|---|
Exact prompt cache | Final prompt → completion | SHA-256 of normalized prompt | Great for deterministic tasks (classification, extraction) |
Fuzzy prompt cache | Similar prompts → canonical result | Embedding + cosine ≥ τ | For near-duplicate queries; cap drift with thresholds |
Embedding cache | Text → vector | Hash(text) | Reuse vectors across features & tenants |
Tool result cache | API/tool outputs | Input params | E.g., currency rates, product catalogs, doc fetch |
Summary cache | Canonical short form | Resource/version | Precompute doc summaries for RAG |
KV cache (runtime) | Model key/attention states | Model/runtime feature | Depends on provider/runtime; best for long-gen chats |
Normalization recipe (pseudocode)
function normalizePrompt(p: Prompt): string {
return JSON.stringify({
role: p.role,
sys: trimWhitespace(p.system).toLowerCase(),
instr: trimWhitespace(p.instruction),
// Sort keys so order changes don't bust the cache:
inputs: sortObjectKeys(p.inputs),
// Strip volatile fields:
meta: { locale: p.meta.locale, version: p.meta.version },
});
}
Fuzzy cache guardrails
- Threshold τ = 0.92–0.96 for embeddings.
- Validate domain constraints (dates, locale, product version).
- Store provenance so you can audit misses.
4) Hybrid models: right-size the brain
Routing policy (example)
- Easy (classification, dedupe, extraction on known templates) → small model (fast/cheap).
- Medium (grounded Q&A with good KB) → mid-tier.
- Hard/novel (creative synthesis, missing context, low confidence) → top model.
- Fail-safes → human-in-the-loop or refusal.
Signals to route on
- Confidence (self-reported + retrieval score)
- Coverage (did RAG find enough?)
- Task complexity (schema known? number of fields?)
- Risk (billing/legal → escalate or refuse)
Distill and specialize
- Distill frequent tasks into small finetunes or adapters.
- Keep a style/tone adapter per brand or locale.
- Periodically re-distill from transcripts + gold data.
5) Batch, schedule, and stream
- Bulk embeddings: 5–10× cheaper at high throughput; backoff & retry queues.
- Nightly jobs: long summaries, index rebuilds, document OCR.
- Streaming outputs: show tokens as they land; cut perceived latency and allow early stop UI.
- Speculative drafts: precompute likely continuations for common prompts (where infra supports).
6) Observability: cost + quality together
- Traces: per-step spans (RAG fetch, prompt build, call, post-process).
- Feature dashboards: tokens, cache hit rate, route decisions, latency, error types.
- Unit evals: weekly golden sets; auto-open issues for regressions.
- Drift: alert on context length growth, prompt bloat, or cache thrash.
- Budgets & alerts: per-tenant and per-feature caps with throttles.
Quality gates (example)
- Critical error rate < 1%
- Hallucination rate < 2% (grounded Q&A)
- JSON validity > 99.5%
- Citation overlap ≥ 0.8
7) Governance & safety
- Refusal rules for restricted topics/actions.
- PII redaction before logging/caching.
- Versioned prompts (immutable IDs) and rollbacks.
- Per-tenant isolation for caches if data sensitivity requires.
- Data retention windows for caches; forget on request (GDPR).
8) Cost levers cheat sheet
Lever | Typical Save | Risk | How to ship |
---|---|---|---|
System prompt trim | 10–25% | Low | Shorten, reference policy via RAG |
RAG top-K tuning | 5–15% | Low | Smaller chunks + filters |
Output token caps | 5–20% | Medium | Enforce JSON schemas |
Exact cache | 20–60% on repeat tasks | Low | Normalize & hash |
Fuzzy cache | 10–40% | Medium | τ threshold + guardrails |
Routing (small→big) | 20–50% | Medium | Confidence + coverage signals |
Distillation | 10–30% | Medium | Rebuild quarterly |
Batch embeddings | 2–5× throughput | Low | Queue + backoff |
Night jobs | Latency relief | Low | Schedule non-urgent work |
9) Example: bringing it together (Node/TS-style pseudocode)
const req = buildPrompt(ctx);
const key = hash(normalizePrompt(req));
let out = cache.get(key) ?? fuzzyCache.match(req);
if (!out) {
const route = chooseModelRoute({
coverage: rag.coverageScore,
confidence: selfCheck(req),
risk: riskTag(ctx),
});
out = await callLLM(route.model, req);
cache.set(key, out, { ttl: route.ttl });
}
const validated = validateJSON(out, schema);
if (!validated.ok) return escalate(req, out);
return {
value: validated.value,
cost: estimateCost(out, route.model),
tokens: { in: req.toks, out: out.toks },
provenance: { passages: rag.passages, model: route.model },
};
10) Rollout plan (90 days)
Week 1–2: Instrument tokens & costs, add exact cache, trim prompts.
Week 3–4: RAG chunk/filters, output schemas, fuzzy cache v1.
Week 5–6: Routing policy + small model path; start batch embeddings.
Week 7–8: Distill frequent tasks; nightly jobs; dashboards.
Week 9–12: Tighten thresholds, add budgets, quarterly distill plan.
FAQ
How do we keep quality while cutting cost?
Measure it. Use confidence/coverage routing, JSON schemas, and grounded RAG. Promote only what passes evals.
Is fuzzy caching safe?
Yes, if you apply a high similarity threshold and validate domain constraints (date/locale/version).
When should we finetune?
When a task repeats at scale with stable schemas and style. Distill from your best outputs.
Talk to Silyze
Need help implementing this stack (observability, caches, routing, distillation) in your codebase? support@silyze.com, we’ll help you cut spend 30–70% while keeping quality green.