Real-time AI — Streaming, SSE and Instant Responses

Executive summary

Users don't wait. According to Google's research, 53% abandon a website if it takes more than 3 seconds to load. AI response times — especially for LLM-based systems — can naturally range from seconds to minutes. Answering a complex question with tool calling, RAG context and multiple iterations may take 5–15 seconds.

The question is not how to speed up the AI (LLM response time is a physical limit). The question is: how do we make the wait invisible?

This whitepaper presents the full architecture of real-time AI communication:

Streaming protocols — SSE, WebSocket, HTTP/2 Server Push
Token streaming — character-by-character LLM output rendering
Multi-phase streaming — combining tool calling, thinking, and answer
Latency optimization — TTFB, cold start, connection pooling
Production challenges — load balancing, reconnect, error handling
UX patterns — skeleton, typing indicator, progressive rendering
Industry direction — 2025–2026 trends and standards

Why is AI streaming different from traditional streaming?

1.1 The classic web: request-response

Traditional web communication is simple: the client sends a request, the server replies. The response arrives in full, then the client renders it.

Client ── GET /api/data ──▶ Server
                              │ (processing: 50ms)
Client ◀── 200 OK + JSON ── Server

This works when the response is ready in milliseconds. It does not work when the server thinks for 8 seconds.

1.2 The nature of LLM responses

LLMs (GPT-4o, Claude, Gemini) generate responses token by token. A 500-token answer is not produced at once — it appears one token at a time, left-to-right, at ~20–80ms/token.

Time:   0ms    50ms    100ms   150ms   200ms   ...   8000ms
Token:  "T"    "h"     "e"     " "     "c"     ...   "[END]"

If you wait for the entire response before sending it to the client, the user stares at an empty screen for 8 seconds. If you stream token by token, the first character appears in 50ms — and the user perceives the AI as answering "instantly."

1.3 The psychology of perception

Response time	User perception
< 100ms	Instant
100ms – 1s	Fast, but noticeable delay
1s – 3s	"Thinking" — acceptable with feedback
3s – 10s	Slow — but tolerable with streaming
> 10s	Unacceptable — the user is gone

Streaming does not reduce actual response time. It reduces Time to First Byte (TTFB) — the time until the user sees the first character. The difference between 50ms and 8s is the difference between an "instant" and an "unacceptable" experience.

Streaming protocols: SSE, WebSocket, and the rest

2.1 Server-Sent Events (SSE)

SSE is the simplest streaming protocol: the server pushes a one-way data stream to the client over HTTP.

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
Transfer-Encoding: chunked

event: message
data: {"type":"token","content":"T"}

event: message
data: {"type":"token","content":"h"}

event: done
data: {"type":"done","usage":{"input":150,"output":87}}

Advantages:

HTTP-based — passes through proxies, CDNs, load balancers
Automatic reconnection (the browser's EventSource API handles it)
Simple implementation (server: res.write(), client: EventSource)
Firewall-friendly — most enterprise firewalls allow HTTP

Disadvantages:

One-way (server → client). The client cannot push data on the stream
Text-based — not ideal for binary data (audio, image)
Browsers limit to ~6 parallel SSE connections per domain (HTTP/1.1)
No built-in heartbeat — proxies may time out a quiet connection

2.2 WebSocket

WebSocket provides bidirectional, full-duplex communication. Pros: binary support, two-way (client can send a "stop generating" message). Cons: more complex infrastructure, problematic proxy/CDN compatibility, manual reconnect.

2.3 HTTP/2 Server Push and HTTP/3

HTTP/2 supports multiplexed streams over a single TCP connection — solving SSE's 6-connection limit. HTTP/3 (QUIC) goes further: 0-RTT setup, no Head-of-Line blocking, better mobile performance.

2.4 Which one when?

Aspect	SSE	WebSocket	HTTP/2 Streaming
AI chat streaming	Ideal	Works, but overkill	Good
Client → server	Separate HTTP request	Built-in	Separate stream
Infrastructure	Low complexity	High	Medium
CDN/proxy compat.	Excellent	Problematic	Good
Auto-reconnect	Built-in	Manual	Manual
Binary (audio, image)	No	Yes	Yes

The industry consensus in 2026: for AI chat streaming, SSE is the default. OpenAI, Anthropic, Google and Mistral APIs all use SSE. WebSocket comes in when you need voice AI or binary data streaming.

Token streaming: real-time rendering of LLM output

3.1 The basic mechanism

LLM providers stream responses token-by-token with stream: true.

OpenAI format:

data: {"choices":[{"delta":{"content":"T"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":"h"},"finish_reason":null}]}
data: [DONE]

Anthropic format:

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"T"}}
event: message_stop
data: {"type":"message_stop"}

3.2 The proxy layer: provider → app → client

In production, the LLM stream does not go directly to the client. An intermediate layer:

Receives the LLM provider's stream
Transforms it to the application's own format
Forwards to the client
Logs (token count, latency, error)
Persists the full response to the database

3.3 The Markdown streaming problem

LLMs often respond in Markdown. Markdown is not character-compatible — a half-arrived code block or table breaks rendering.

Strategies:

Delayed rendering — 100–200ms buffer
Incremental Markdown parser — handles partial Markdown
Dual rendering — plain text during stream, Markdown re-render at end (ChatGPT)
Token batching — emit per sentence/paragraph

Multi-phase streaming: tool calling + thinking + answer

4.1 The real challenge: it's not just text

An AI agent's response is not a single text stream. The typical flow:

Phase 1: Thinking (tool selection)            ← 200–500ms
Phase 2: Tool call (CRM query, RAG)           ← 100–2000ms
Phase 3: Tool result processing               ← invisible
Phase 4: Second thinking pass                 ← 200ms
Phase 5: Possibly another tool call           ← 100–2000ms
Phase 6: Final answer generation (stream)     ← 2000–8000ms

4.2 Two-phase streaming architecture

Phase 1: Tool calling (status events)

Client ← {"type": "status", "message": "Looking up information..."}
Client ← {"type": "status", "message": "Querying CRM data..."}
Client ← {"type": "status", "message": "3 hits in the knowledge base..."}

Phase 2: Answer streaming (token events)

Client ← {"type": "token", "content": "Anna"}
Client ← {"type": "token", "content": " Smith"}
Client ← {"type": "done", "usage": {...}}

4.3 The "thinking" stream

Some LLMs (Claude 3.5, o1-preview) also stream a chain of thought. UX decision:

Show it (Claude, DeepSeek): "Transparent AI" for tech-savvy users
Hide it (ChatGPT): cleaner UX for general audiences
Summarize (compromise): short status messages

Latency optimization: every millisecond counts

5.1 Anatomy of latency

Total TTFB (Time to First Byte):

Best case: ~300ms (fast RAG, no queue, fast LLM)
Typical: ~800ms–2s
Worst case: ~5–15s (slow RAG, LLM queue, multiple tool iterations)

5.2 The 8 optimization points

Connection pooling to the LLM provider — ~100ms saved per call
RAG context prefetch — parallel execution, max(200ms, 50ms, 30ms)
Streaming-first LLM call — always stream: true, lower TTFT
Prompt optimization — shorter prompt = faster prefill
Edge computing — server closer to the user, EU region
Cold start elimination — dedicated server or warm container
Response caching — FAQ, system prompt cache (Anthropic ~75% savings)
LLM provider failover — adapter pattern, OpenAI → Claude → Gemini

5.3 The latency budget

Phase	Budget	Optimization
Client → server	< 50ms	EU server, CDN
Auth + parsing	< 10ms	Cached JWT, minimal middleware
RAG + history	< 200ms	Parallel, pgvector index, limit
LLM TTFT	< 1000ms	Streaming, prompt optimization
Server → client	< 50ms	SSE, keep-alive, no buffering
TTFB (total)	< 1500ms	Acceptable UX

Production challenges

6.1 Load balancing with SSE

An SSE stream is a long-running HTTP connection — load balancers run on idle timeouts. If the SSE stream sends no data for 30 seconds (because the LLM is thinking), the load balancer drops the connection.

Solutions:

Heartbeat/keepalive SSE event — periodic empty comment (:) line in the stream
Increase proxy timeouts — Nginx proxy_read_timeout, HAProxy timeout server
Sticky sessions — session affinity across multiple backend servers

6.2 Reconnect handling

Connections break in the real world. The SSE EventSource API gives automatic reconnection with Last-Event-ID support. The challenge: the LLM provider stream is not replayable — tokens that already arrived must be buffered (in memory or Redis) and resent on client reconnect.

6.3 Concurrent streams and resource management

Parameter	Value
Concurrent streams	100
Avg. stream length	15s
Token buffer / stream	~5KB
Conversation context / stream	~20KB
HTTP connection overhead	~10KB
Total per stream	~35KB
100 concurrent streams	~3.5MB

6.4 Mid-stream error handling

Best practice: emit an error event + persist the partial response + show a client-side "Regenerate" button.

6.5 Backpressure: when the client is slower than the server

Node.js pipe() naturally handles backpressure — but it must be tested explicitly on edge cases.

UX patterns for real-time AI

7.1 Pattern overview

Pattern	When to use it
Typing indicator	Simple chat, "AI is typing..." animation
Skeleton loading	Structured responses (cards, tables, lists)
Progressive disclosure	Complex answers — gist first, details later
Stop generating	Always — user control
Suggested actions	After the answer — quick replies tied to MCP tools

Voice AI: the next frontier

8.1 Voice streaming is different

Text streaming is "luxury" — the user watches the text appear. Voice is different: latency is felt directly.

Text chat	Voice AI
1–2s TTFB acceptable	> 500ms TTFB feels "robotic"
User reads	User listens → silence = bad
Response is scrollable	Audio is gone, not replayable inline
Markdown, tables, code → rich	Linear text only

8.2 The voice AI pipeline

User ──[audio]──▶ STT (Speech-to-Text)        ~200–500ms
                       │
                       ▼
                  LLM (text generation)        ~500–2000ms
                       │
                       ▼
                  TTS (Text-to-Speech)         ~200–500ms
                       │
User ◀──[audio]───────┘

Total latency: 900–3000ms

8.3 Latency reduction in voice AI

Streaming TTS — generate audio sentence by sentence
Filler sounds — "Hmm...", "So..." (a Google Duplex trick since 2018)
Sentence boundary detection — detect ., !, ? and send to TTS immediately
Speculative execution — the system pre-guesses the user's question

8.4 Protocols for voice AI

WebSocket is the default voice-AI protocol in 2026. WebRTC kicks in when P2P audio with minimal latency is required.

Industry direction: 2025–2026

9.1 Provider streaming API evolution

OpenAI Realtime API (Oct 2024): WebSocket-based, text + audio, full-duplex
Anthropic prompt caching: ~75% input token savings → faster prefill
Gemini 2.0 Flash: ~20ms/token, the fastest TTFT on the market
Mistral EU-hosted: GDPR-compliant streaming endpoints

9.2 Edge AI — streaming without a remote LLM

The most radical latency reduction: don't call a remote LLM. Small models (Gemma 2B, Phi-3 Mini, Llama 3 8B) can run on mobile devices (Apple Neural Engine, Qualcomm NPU), edge servers (Cloudflare Workers AI), or on-prem GPUs.

Aspect	Cloud LLM	Edge LLM
TTFT	200–1000ms	20–100ms
Tokens/s	20–80	30–150 (hardware-dependent)
Cost	Per-token API fee	Fixed hardware + energy
Model size	Unlimited	Max ~13B (mobile: ~3B)
Quality	GPT-4o level	Lower, but improving
Privacy	Data leaves to the cloud	Data stays local

2026 trend: hybrid approach — simple questions answered by a local (edge) model (~50ms TTFT), complex questions by a cloud LLM (~800ms TTFT). The routing decision is made by a small classifier model.

9.3 Speculative decoding — internal LLM acceleration

A small draft model (~1B) quickly generates 5–10 tokens → the large target model (~70B) validates them in one step. Result: ~2–3x speedup with no quality loss. Gemini 2.0 and Claude 3.5 already use it.

9.4 Structured output streaming

OpenAI introduced response_format: { type: "json_schema" } streaming in 2025 — the model emits tokens conforming to the JSON schema, and the client can parse partial JSON. Useful in the AI agent Evaluator: the Executor can start acting on early actions while later ones are still being generated.

Practical decision framework

10.1 The streaming decision tree

Is the AI response interactive (user is waiting)?
  │
  ├─ Yes → Text or voice?
  │           │
  │           ├─ Text → SSE (default)
  │           │           HTTP/2 + SSE if available
  │           │           WebSocket if you need bidirectionality
  │           │
  │           └─ Voice → WebSocket (binary)
  │                      WebRTC for ultra-low latency
  │
  └─ No (background task) → No streaming needed
                             Queue-based (BullMQ) + webhook

10.2 Monitoring checklist

Metric	What it measures	Target
TTFB (P50)	Median time to first byte	< 1s
TTFB (P95)	95th percentile	< 3s
Token rate	Tokens per second on the stream	> 15 t/s
Stream error rate	Aborted streams ratio	< 1%
Reconnect rate	Reconnection ratio	< 5%
Client render lag	Client-side rendering delay	< 50ms/token

Summary — the 7 most important takeaways

SSE is the default: For AI chat streaming, SSE is simple, reliable, and compatible.
TTFB > total response time: Time-to-first-token matters more than total response time.
Multi-phase streaming: Communicate the tool calling + RAG + answer phases to the client.
Latency budget: Define, measure, optimize. Biggest wins: parallel RAG/history and prompt size.
Production edge cases: Heartbeat, reconnect + token buffer, backpressure, error handling.
Voice AI is a different category: < 500ms TTFB, WebSocket, sentence-level TTS, filler sounds.
Edge AI + speculative decoding: A hybrid future — easy tasks locally, complex tasks in the cloud.

The single rule

The best streaming implementation is the one users don't notice — they only feel that the AI answers "instantly."

Designing real-time AI chat or a voice agent?

The Atlosz team helps you design and implement streaming architecture, SSE/WebSocket integration, latency optimization, and production edge cases (reconnect, backpressure, error handling) — tailored to your system.

Let's talk about your AI strategy →