AI Engineering · Inference · Serving

LLM Inference Optimization: Agentic System-এর নিচের Engine

Agentic system post-এ দেখলাম gateway, orchestrator, memory, agents। সব ঠিক আছে। কিন্তু যখন research agent আসলে LLM call করে, তখন কী হয়? Inference। Prompt যায়, tokens generate হয়, latency আর cost সেখানে জন্মায়।

এই post সেই layer নিয়ে: production-এ inference কীভাবে optimize করবে, কোন metric track করবে, আর এটা agentic system design post-এর সাথে কীভাবে fit করে।

Asif Bin Syed · ML Researcher · OMSCS @ Georgia Tech · May 2026

Part 1: কেন inference আলাদা problem · Part 2: Techniques ও stack · Deep dives: Metrics · Batching · KV cache · Quantization · Serving · Agentic link

Part 1: কেন inference আলাদা problem

Agentic architecture ঠিক করলেও user slow feel করতে পারে। কারণ প্রতিটা agent step-এ LLM forward pass লাগে। ৮টা step মানে ৮বার inference। Orchestrator smart হলেও serving layer slow হলে পুরো system slow।

Agentic design answers: কতবার LLM call হবে। Inference optimization answers: প্রতিটা call কত দ্রুত ও সস্তা।

Real symptom

Dashboard-এ দেখলে:

P99 end-to-end latency: 45 seconds
Gateway span: 200ms
Orchestrator span: 3 seconds
LLM inference span: 38 seconds

Problem orchestrator না। Problem inference।

Inference stack

উপরে policy (gateway)। নিচে physics (GPU, memory bandwidth)। মাঝখানে serving engine: সেখানেই বেশিরভাগ optimization।

তিনটা bottleneck

Bottleneck	লক্ষণ	Typical fix
Memory	OOM, short max context	Quantization, KV paging, smaller model
Compute	Low tokens/s, GPU 100%	Batching, FlashAttention, better GPU
Scheduling	Queue buildup, tail latency	Continuous batching, priority queues

Interview tip: “We optimized agents at the workflow level, then at the serving level with batching and KV cache, then at the model level with quantization.”

Part 2: Techniques ও stack

চলো technique গুলো map করি। কোনটা কী solve করে।

Technique	Primarily fixes	Trade-off
Continuous batching	GPU utilization, throughput	Tail latency variance
KV cache / PagedAttention	Memory per request	Implementation complexity
Prefix caching	Repeated system prompts	Cache invalidation
Quantization (INT8/FP8)	Memory, sometimes speed	Small quality drop
Speculative decoding	Latency per token	Draft model maintenance
Smaller / routed models	Cost, latency	Capability ceiling
Shorter context	Memory + prefill time	Less information

একটাও silver bullet না। Production-এ stack করতে হয়।

Inference metrics (deep dive)

What to measure before you optimize

Optimize করার আগে measure করো। নাহলে blind tuning।

Core metrics

TTFT (Time To First Token)
User prompt পাঠানোর পর প্রথম token কতক্ষণে আসে। Prefill phase-এর proxy। Long system prompt হলে TTFT বাড়ে।

Tokens per second (decode throughput)
Generation phase-এ প্রতি সেকেন্ডে কত token। User “typing speed” feel করে এটা থেকে।

End-to-end latency
পুরো response শেষ হতে কত সময়। TTFT + (output_tokens / tokens_per_sec) approximate।

Cost per 1M tokens
Gateway billing-এর সাথে match করো। Input vs output price আলাদা (output সাধারণত дорী)।

GPU utilization
GPU idle থাকলে batching বাড়াও। GPU OOM থাকলে quantization বা context কমাও।

What to log (minimal)

# inference/metrics.py
import time
from dataclasses import dataclass, field


@dataclass
class InferenceRecord:
    model: str
    input_tokens: int = 0
    output_tokens: int = 0
    ttft_ms: float = 0.0
    total_ms: float = 0.0
    tokens_per_sec: float = 0.0
    gpu: str = ""
    quantized: bool = False
    agent: str = ""
    run_id: str = ""


def record_from_stream(
    model: str,
    input_tokens: int,
    first_token_at: float,
    start: float,
    output_tokens: int,
    agent: str = "",
) -> InferenceRecord:
    total_ms = (time.time() - start) * 1000
    ttft_ms = (first_token_at - start) * 1000
    decode_sec = max((time.time() - first_token_at), 0.001)
    tps = output_tokens / decode_sec if output_tokens else 0.0
    return InferenceRecord(
        model=model,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        ttft_ms=ttft_ms,
        total_ms=total_ms,
        tokens_per_sec=round(tps, 2),
        agent=agent,
    )

Agentic post-এর observability-তে llm.request.duration histogram থাকে। Inference post-এ TTFT আলাদা histogram রাখো। একই slow request-এর মধ্যে prefill vs decode আলাদা করতে পারবে।

Common mistake: শুধু total latency দেখা। TTFT খারাপ হলে user মনে করে “model stuck”, যদিও decode fast।

Continuous batching (deep dive)

Why one request per GPU forward pass wastes money

Classic serving: একটা request শেষ, তারপর পরেরটা। GPU অনেক সময় wait করে। Continuous batching (vLLM style) একই forward pass-এ multiple sequences process করে যতক্ষণ তাদের next token generate করতে হবে।

How it works (simple)

Request queue-তে আসে (different lengths)।
Scheduler active sequences batch করে।
এক step: সবাই এক token generate করে।
যার response শেষ, batch থেকে বের হয়।
নতুন request batch-এ ঢোকে।

Throughput বাড়ে। Cost per token কমে।

Trade-off

Pro	Con
Higher GPU utilization	Tail latency less predictable
Better cost at scale	Harder to debug per-request
Fits many concurrent users	Needs good serving framework

Research harness-এ একজন user হলে benefit কম। Production agentic system-এ অনেক parallel agent step হলে benefit বড়।

Config sketch (vLLM)

# start vLLM OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Gateway (base_url → local vLLM) দিয়ে agents same API use করবে। Agentic architecture change না, শুধু inference backend swap।

KV cache (deep dive)

The hidden memory cost of long contexts

Transformer decode-এ প্রতিটা নতুন token generate করতে আগের সব token-এর attention state লাগে। সেটাই KV cache। Context লম্বা হলে cache বড়। অনেক concurrent user হলে OOM।

PagedAttention (idea)

OS virtual memory-র মতো: KV cache-কে non-contiguous pages-এ রাখো, on-demand allocate। Result: বেশি concurrent requests একই GPU-তে।

Prefix caching

Agentic system-এ system prompt + skill প্রায় same থাকে। শুধু user task বদলায়। Prefix cache হলে shared prefix-এর KV reuse। TTFT অনেক কমে।

Pattern	Prefix cache benefit
Same research agent, new query	High (big system prompt)
Totally new agent each call	Low
RAG with huge retrieved docs	Low (prefix changes)

Context budget (agentic + inference)

Memory layer যা retrieve করে সেটা prompt-এ যায়। Inference-এর কাছে এটা input token bill।

Rule: retrieve কম, relevant বেশি। 50k token context fill করা inference win না, loss।

Quantization (deep dive)

Smaller weights, faster math, less VRAM

Full FP16 7B model ~14GB+ weights। INT8/FP8 half এর কাছাকাছি। Apple Silicon-এ GGUF quantized models (llama.cpp) research-এর জন্য practical।

Levels (practical order)

Format	Typical use	Quality
FP16 / BF16	Baseline GPU serve	Best
FP8	H100 class datacenter	Very good
INT8 / AWQ / GPTQ	Consumer GPU, vLLM	Good for many tasks
4-bit GGUF	Local laptop (MPS)	OK for drafts, classify

When to quantize

Yes: classify, routing, short summaries, high volume
Maybe: long reasoning, code generation (evaluate first)
No (first): one-off critical eval before paper numbers

Local research example

# inference/local_llm.py (llama.cpp via llama-cpp-python sketch)
from llama_cpp import Llama

llm = Llama(
    model_path="./models/Meta-Llama-3.2-3B-Instruct-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # Metal on M-series
)

out = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You classify ML paper topics."},
        {"role": "user", "content": query},
    ],
    max_tokens=256,
    temperature=0.0,
)

Same agent code, gateway points to http://127.0.0.1:8080/v1 instead of OpenAI। Architecture unchanged।

Serving stack (deep dive)

What to run in production vs on a laptop

Option matrix

Stack	Best for	Notes
vLLM	Production GPU, agents at scale	PagedAttention, OpenAI API
TGI (Hugging Face)	HF ecosystem, enterprise	Good ops story
TensorRT-LLM	NVIDIA max performance	More setup
llama.cpp	Mac / edge / offline	GGUF, MPS, CPU
Managed API	Fastest to ship	OpenAI, Anthropic: they optimize for you

Minimal production path

Dev: managed API (fast iteration)।
Staging: vLLM on one GPU, gateway routes strong model locally।
Prod: autoscale GPU pool, observability on TTFT + tps।
Research laptop: quantized local model for cheap loops।

OpenAI-compatible gateway hook

Agentic post-এর LiteLLM config-এ logical name রাখো:

model_list:
  - model_name: fast
    litellm_params:
      model: openai/gpt-4o-mini
  - model_name: local-fast
    litellm_params:
      model: openai/meta-llama/Llama-3.2-3B-Instruct
      api_base: http://127.0.0.1:8000/v1
      api_key: local

Non-critical agent steps local-fast use করলে cost drop।

Speculative decoding (overview)

Draft model guesses; target model verifies

ছোট draft model দ্রুত কয়েক token propose করে। বড় target model parallel verify করে। Accept হলে একাধিক token এক step-এ। Latency কমে, quality target model-এর মতো থাকে।

Trade-off: দুই model load, tuning overhead। High-QPS serving-এ worth it। Toy research harness-এ often skip।

Agentic system-এর সাথে connection

দুই post একই production story-র দুই layer।

[Agentic post]                    [This post]
Gateway (which model)      →      Same API, local or cloud engine
Orchestrator (how many calls) →   Fewer calls = less inference load
Memory (context size)      →      Shorter prefill, smaller KV cache
Sub-agents (Haiku vs Sonnet)→     Model size = inference cost
Observability (traces)     →      + TTFT, tokens/s, GPU metrics
Tool registry / MCP        →      (mostly separate)

Optimization order (what I would do)

Workflow: কম LLM call (orchestrator, parallel tools)।
Routing: cheap model where possible (gateway)।
Context: memory retrieve threshold, token budget।
Serving: vLLM + batching + prefix cache।
Weights: quantization for volume paths।
Advanced: speculative decoding if still latency-bound।
Evaluate: agent evaluation before/after each change (pass rate, cost per successful task, not just faster failure)।

System design interview-এ বলতে পারো: “We attacked cost at the agent graph first, then at the inference engine with batching and KV cache, then with quantization for the classification path.”

ML research harness

Paper replication-এ inference optimization মানে:

Smoke tests: tiny quantized local model
Long runs: cloud GPU + vLLM or managed API with budget cap
Log: model hash, quant format, tokens/s per step (reproducibility)

Same separation as agentic post-এর observability for science: which run produced this output?

Part 2 summary

Section	Focus
Part 1	Stack, bottlenecks, link to agentic architecture
Metrics	TTFT, tokens/s, cost, what to log
Batching	vLLM, throughput vs tail latency
KV cache	PagedAttention, prefix cache, context budget
Quantization	FP8/INT8/GGUF, when safe
Serving	vLLM, TGI, llama.cpp, LiteLLM routing
Agentic link	Two-layer optimization story

Inference optimization agentic design replace করে না। এটা completes the picture: উপরে coordination, নিচে fast math। Next: agent evaluation to prove the stack still works after each optimization.

কোনো question থাকলে comment করো। #Inference #LLMServing #vLLM #AIEngineering