Back to Knowledge Base
WhitepaperStreamingSSEWebSocketToken streamingTTFBLatencyReal-time AIVoice AIHTTP/2Tool callingUXEdge AISpeculative decoding

Real-time AI — Streaming, SSE and Instant Responses

ÁZ&A
Ádám Zsolt & AIMY
||15 min read

Executive summary

Users don't wait. According to Google's research, 53% abandon a website if it takes more than 3 seconds to load. AI response times — especially for LLM-based systems — can naturally range from seconds to minutes. Answering a complex question with tool calling, RAG context and multiple iterations may take 5–15 seconds.

The question is not how to speed up the AI (LLM response time is a physical limit). The question is: how do we make the wait invisible?

This whitepaper presents the full architecture of real-time AI communication:

  1. Streaming protocols — SSE, WebSocket, HTTP/2 Server Push
  2. Token streaming — character-by-character LLM output rendering
  3. Multi-phase streaming — combining tool calling, thinking, and answer
  4. Latency optimization — TTFB, cold start, connection pooling
  5. Production challenges — load balancing, reconnect, error handling
  6. UX patterns — skeleton, typing indicator, progressive rendering
  7. Industry direction — 2025–2026 trends and standards

1. Why is AI streaming different from traditional streaming?

1.1 The classic web: request-response

Traditional web communication is simple: the client sends a request, the server replies. The response arrives in full, then the client renders it.

Client ── GET /api/data ──▶ Server
                              │ (processing: 50ms)
Client ◀── 200 OK + JSON ── Server

This works when the response is ready in milliseconds. It does not work when the server thinks for 8 seconds.

1.2 The nature of LLM responses

LLMs (GPT-4o, Claude, Gemini) generate responses token by token. A 500-token answer is not produced at once — it appears one token at a time, left-to-right, at ~20–80ms/token.

Time:   0ms    50ms    100ms   150ms   200ms   ...   8000ms
Token:  "T"    "h"     "e"     " "     "c"     ...   "[END]"

If you wait for the entire response before sending it to the client, the user stares at an empty screen for 8 seconds. If you stream token by token, the first character appears in 50ms — and the user perceives the AI as answering "instantly."

1.3 The psychology of perception

Response time User perception
< 100msInstant
100ms – 1sFast, but noticeable delay
1s – 3s"Thinking" — acceptable with feedback
3s – 10sSlow — but tolerable with streaming
> 10sUnacceptable — the user is gone

Streaming does not reduce actual response time. It reduces Time to First Byte (TTFB) — the time until the user sees the first character. The difference between 50ms and 8s is the difference between an "instant" and an "unacceptable" experience.


2. Streaming protocols: SSE, WebSocket, and the rest

2.1 Server-Sent Events (SSE)

SSE is the simplest streaming protocol: the server pushes a one-way data stream to the client over HTTP.

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
Transfer-Encoding: chunked

event: message
data: {"type":"token","content":"T"}

event: message
data: {"type":"token","content":"h"}

event: done
data: {"type":"done","usage":{"input":150,"output":87}}

Advantages:

  • HTTP-based — passes through proxies, CDNs, load balancers
  • Automatic reconnection (the browser's EventSource API handles it)
  • Simple implementation (server: res.write(), client: EventSource)
  • Firewall-friendly — most enterprise firewalls allow HTTP

Disadvantages:

  • One-way (server → client). The client cannot push data on the stream
  • Text-based — not ideal for binary data (audio, image)
  • Browsers limit to ~6 parallel SSE connections per domain (HTTP/1.1)
  • No built-in heartbeat — proxies may time out a quiet connection

2.2 WebSocket

WebSocket provides bidirectional, full-duplex communication. Pros: binary support, two-way (client can send a "stop generating" message). Cons: more complex infrastructure, problematic proxy/CDN compatibility, manual reconnect.

2.3 HTTP/2 Server Push and HTTP/3

HTTP/2 supports multiplexed streams over a single TCP connection — solving SSE's 6-connection limit. HTTP/3 (QUIC) goes further: 0-RTT setup, no Head-of-Line blocking, better mobile performance.

2.4 Which one when?

Aspect SSE WebSocket HTTP/2 Streaming
AI chat streamingIdealWorks, but overkillGood
Client → serverSeparate HTTP requestBuilt-inSeparate stream
InfrastructureLow complexityHighMedium
CDN/proxy compat.ExcellentProblematicGood
Auto-reconnectBuilt-inManualManual
Binary (audio, image)NoYesYes

The industry consensus in 2026: for AI chat streaming, SSE is the default. OpenAI, Anthropic, Google and Mistral APIs all use SSE. WebSocket comes in when you need voice AI or binary data streaming.


3. Token streaming: real-time rendering of LLM output

3.1 The basic mechanism

LLM providers stream responses token-by-token with stream: true.

OpenAI format:

data: {"choices":[{"delta":{"content":"T"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":"h"},"finish_reason":null}]}
data: [DONE]

Anthropic format:

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"T"}}
event: message_stop
data: {"type":"message_stop"}

3.2 The proxy layer: provider → app → client

In production, the LLM stream does not go directly to the client. An intermediate layer:

  1. Receives the LLM provider's stream
  2. Transforms it to the application's own format
  3. Forwards to the client
  4. Logs (token count, latency, error)
  5. Persists the full response to the database

3.3 The Markdown streaming problem

LLMs often respond in Markdown. Markdown is not character-compatible — a half-arrived code block or table breaks rendering.

Strategies:

  1. Delayed rendering — 100–200ms buffer
  2. Incremental Markdown parser — handles partial Markdown
  3. Dual rendering — plain text during stream, Markdown re-render at end (ChatGPT)
  4. Token batching — emit per sentence/paragraph

4. Multi-phase streaming: tool calling + thinking + answer

4.1 The real challenge: it's not just text

An AI agent's response is not a single text stream. The typical flow:

Phase 1: Thinking (tool selection)            ← 200–500ms
Phase 2: Tool call (CRM query, RAG)           ← 100–2000ms
Phase 3: Tool result processing               ← invisible
Phase 4: Second thinking pass                 ← 200ms
Phase 5: Possibly another tool call           ← 100–2000ms
Phase 6: Final answer generation (stream)     ← 2000–8000ms

4.2 Two-phase streaming architecture

Phase 1: Tool calling (status events)

Client ← {"type": "status", "message": "Looking up information..."}
Client ← {"type": "status", "message": "Querying CRM data..."}
Client ← {"type": "status", "message": "3 hits in the knowledge base..."}

Phase 2: Answer streaming (token events)

Client ← {"type": "token", "content": "Anna"}
Client ← {"type": "token", "content": " Smith"}
Client ← {"type": "done", "usage": {...}}

4.3 The "thinking" stream

Some LLMs (Claude 3.5, o1-preview) also stream a chain of thought. UX decision:

  • Show it (Claude, DeepSeek): "Transparent AI" for tech-savvy users
  • Hide it (ChatGPT): cleaner UX for general audiences
  • Summarize (compromise): short status messages

5. Latency optimization: every millisecond counts

5.1 Anatomy of latency

Total TTFB (Time to First Byte):

  • Best case: ~300ms (fast RAG, no queue, fast LLM)
  • Typical: ~800ms–2s
  • Worst case: ~5–15s (slow RAG, LLM queue, multiple tool iterations)

5.2 The 8 optimization points

  1. Connection pooling to the LLM provider — ~100ms saved per call
  2. RAG context prefetch — parallel execution, max(200ms, 50ms, 30ms)
  3. Streaming-first LLM call — always stream: true, lower TTFT
  4. Prompt optimization — shorter prompt = faster prefill
  5. Edge computing — server closer to the user, EU region
  6. Cold start elimination — dedicated server or warm container
  7. Response caching — FAQ, system prompt cache (Anthropic ~75% savings)
  8. LLM provider failover — adapter pattern, OpenAI → Claude → Gemini

5.3 The latency budget

Phase Budget Optimization
Client → server< 50msEU server, CDN
Auth + parsing< 10msCached JWT, minimal middleware
RAG + history< 200msParallel, pgvector index, limit
LLM TTFT< 1000msStreaming, prompt optimization
Server → client< 50msSSE, keep-alive, no buffering
TTFB (total)< 1500msAcceptable UX

6. Production challenges

6.1 Load balancing with SSE

An SSE stream is a long-running HTTP connection — load balancers run on idle timeouts. If the SSE stream sends no data for 30 seconds (because the LLM is thinking), the load balancer drops the connection.

Solutions:

  1. Heartbeat/keepalive SSE event — periodic empty comment (:) line in the stream
  2. Increase proxy timeouts — Nginx proxy_read_timeout, HAProxy timeout server
  3. Sticky sessions — session affinity across multiple backend servers

6.2 Reconnect handling

Connections break in the real world. The SSE EventSource API gives automatic reconnection with Last-Event-ID support. The challenge: the LLM provider stream is not replayable — tokens that already arrived must be buffered (in memory or Redis) and resent on client reconnect.

6.3 Concurrent streams and resource management

Parameter Value
Concurrent streams100
Avg. stream length15s
Token buffer / stream~5KB
Conversation context / stream~20KB
HTTP connection overhead~10KB
Total per stream~35KB
100 concurrent streams~3.5MB

6.4 Mid-stream error handling

Best practice: emit an error event + persist the partial response + show a client-side "Regenerate" button.

6.5 Backpressure: when the client is slower than the server

Node.js pipe() naturally handles backpressure — but it must be tested explicitly on edge cases.


7. UX patterns for real-time AI

7.1 Pattern overview

Pattern When to use it
Typing indicatorSimple chat, "AI is typing..." animation
Skeleton loadingStructured responses (cards, tables, lists)
Progressive disclosureComplex answers — gist first, details later
Stop generatingAlways — user control
Suggested actionsAfter the answer — quick replies tied to MCP tools

8. Voice AI: the next frontier

8.1 Voice streaming is different

Text streaming is "luxury" — the user watches the text appear. Voice is different: latency is felt directly.

Text chat Voice AI
1–2s TTFB acceptable> 500ms TTFB feels "robotic"
User readsUser listens → silence = bad
Response is scrollableAudio is gone, not replayable inline
Markdown, tables, code → richLinear text only

8.2 The voice AI pipeline

User ──[audio]──▶ STT (Speech-to-Text)        ~200–500ms
                       │
                       ▼
                  LLM (text generation)        ~500–2000ms
                       │
                       ▼
                  TTS (Text-to-Speech)         ~200–500ms
                       │
User ◀──[audio]───────┘

Total latency: 900–3000ms

8.3 Latency reduction in voice AI

  1. Streaming TTS — generate audio sentence by sentence
  2. Filler sounds — "Hmm...", "So..." (a Google Duplex trick since 2018)
  3. Sentence boundary detection — detect ., !, ? and send to TTS immediately
  4. Speculative execution — the system pre-guesses the user's question

8.4 Protocols for voice AI

WebSocket is the default voice-AI protocol in 2026. WebRTC kicks in when P2P audio with minimal latency is required.


9. Industry direction: 2025–2026

9.1 Provider streaming API evolution

  • OpenAI Realtime API (Oct 2024): WebSocket-based, text + audio, full-duplex
  • Anthropic prompt caching: ~75% input token savings → faster prefill
  • Gemini 2.0 Flash: ~20ms/token, the fastest TTFT on the market
  • Mistral EU-hosted: GDPR-compliant streaming endpoints

9.2 Edge AI — streaming without a remote LLM

The most radical latency reduction: don't call a remote LLM. Small models (Gemma 2B, Phi-3 Mini, Llama 3 8B) can run on mobile devices (Apple Neural Engine, Qualcomm NPU), edge servers (Cloudflare Workers AI), or on-prem GPUs.

Aspect Cloud LLM Edge LLM
TTFT200–1000ms20–100ms
Tokens/s20–8030–150 (hardware-dependent)
CostPer-token API feeFixed hardware + energy
Model sizeUnlimitedMax ~13B (mobile: ~3B)
QualityGPT-4o levelLower, but improving
PrivacyData leaves to the cloudData stays local

2026 trend: hybrid approach — simple questions answered by a local (edge) model (~50ms TTFT), complex questions by a cloud LLM (~800ms TTFT). The routing decision is made by a small classifier model.

9.3 Speculative decoding — internal LLM acceleration

A small draft model (~1B) quickly generates 5–10 tokens → the large target model (~70B) validates them in one step. Result: ~2–3x speedup with no quality loss. Gemini 2.0 and Claude 3.5 already use it.

9.4 Structured output streaming

OpenAI introduced response_format: { type: "json_schema" } streaming in 2025 — the model emits tokens conforming to the JSON schema, and the client can parse partial JSON. Useful in the AI agent Evaluator: the Executor can start acting on early actions while later ones are still being generated.


10. Practical decision framework

10.1 The streaming decision tree

Is the AI response interactive (user is waiting)?
  │
  ├─ Yes → Text or voice?
  │           │
  │           ├─ Text → SSE (default)
  │           │           HTTP/2 + SSE if available
  │           │           WebSocket if you need bidirectionality
  │           │
  │           └─ Voice → WebSocket (binary)
  │                      WebRTC for ultra-low latency
  │
  └─ No (background task) → No streaming needed
                             Queue-based (BullMQ) + webhook

10.2 Monitoring checklist

Metric What it measures Target
TTFB (P50)Median time to first byte< 1s
TTFB (P95)95th percentile< 3s
Token rateTokens per second on the stream> 15 t/s
Stream error rateAborted streams ratio< 1%
Reconnect rateReconnection ratio< 5%
Client render lagClient-side rendering delay< 50ms/token

11. Summary — the 7 most important takeaways

  1. SSE is the default: For AI chat streaming, SSE is simple, reliable, and compatible.
  2. TTFB > total response time: Time-to-first-token matters more than total response time.
  3. Multi-phase streaming: Communicate the tool calling + RAG + answer phases to the client.
  4. Latency budget: Define, measure, optimize. Biggest wins: parallel RAG/history and prompt size.
  5. Production edge cases: Heartbeat, reconnect + token buffer, backpressure, error handling.
  6. Voice AI is a different category: < 500ms TTFB, WebSocket, sentence-level TTS, filler sounds.
  7. Edge AI + speculative decoding: A hybrid future — easy tasks locally, complex tasks in the cloud.

The single rule

The best streaming implementation is the one users don't notice — they only feel that the AI answers "instantly."


Designing real-time AI chat or a voice agent?

The Atlosz team helps you design and implement streaming architecture, SSE/WebSocket integration, latency optimization, and production edge cases (reconnect, backpressure, error handling) — tailored to your system.

Let's talk about your AI strategy →