How Should AI Reply "Instantly"? — Streaming and Real-time AI in Practice

The 8-second silence that kills your AI product

Picture this: your customer clicks the chat icon on your website, types a question — and the screen stays empty. 1 second. 3 seconds. 5 seconds. The customer is already suspicious. 8 seconds. They close the tab.

And yet the AI was working in the background. It queried the CRM, searched the knowledge base, generated a detailed answer. It just didn't tell the user about any of it.

That's the difference between a demo-grade AI integration and a production AI product. And the solution is not to speed up the LLM — that's a physical limit. The solution is streaming.

According to Google's research, 53% of users abandon a page that takes more than 3 seconds to load. At the same time, a modern LLM (GPT-4o, Claude, Gemini) answering a complex question — especially with tool calling and RAG context — can take 5–15 seconds. Those two numbers don't fit together. Streaming architecture bridges that gap.

This article walks through what real-time AI means in practice, why SSE is the industry default, how to make the 8-second wait invisible, and which production traps await you when the system goes live.

The psychology of perception: why speed isn't what counts

The human brain doesn't measure response time with a stopwatch. It measures waiting. And waiting feels different when you're staring at an empty screen versus when you can see "something happening."

The classic UX scale:

0–100ms: instant
100ms–1s: fast, but noticeable
1–3s: "thinking" — still acceptable with feedback
3–10s: slow, but tolerable if you see something
10s+: unacceptable

Streaming doesn't reduce the actual response time — an 8-second LLM response is still 8 seconds. What it reduces is Time to First Byte (TTFB): the time until the user sees the first character.

Without streaming, TTFB equals total response time (8s). With streaming, TTFB is ~50–500ms. That's the difference between "instant" and "unacceptable."

That's why ChatGPT types character by character, why Claude streams continuously, why Gemini shows you something immediately. Not because it's flashy — but because without it, the product would be unusable.

SSE, WebSocket, HTTP/2 — which one for AI?

When streaming comes up, developers often jump to WebSocket "because that's how it's done." For AI chat streaming, SSE (Server-Sent Events) is almost always the better choice.

Why SSE?

SSE is a simple HTTP-based stream: the server pushes data over an open connection, the client consumes it through an EventSource. Some advantages:

HTTP-based: passes through every proxy, CDN, load balancer, and corporate firewall
Built-in reconnection: the browser auto-reconnects after a drop
Simple implementation: server-side res.write(), client-side new EventSource(url)
Every major LLM provider uses SSE: OpenAI, Anthropic, Google, Mistral

When do you actually need WebSocket?

Two cases:

Voice AI — audio data is binary, SSE can only carry text
Bidirectional streams — when the client wants to send data mid-stream to the server (e.g., a "stop generating" button). With SSE you have to use a separate HTTP request

HTTP/2 and HTTP/3 (QUIC) also improve streaming — multiplexed streams over one TCP connection, 0-RTT setup, better mobile performance. But these improve the HTTP infrastructure layer; the application protocol is still SSE.

One sentence: text AI chat → SSE. Voice AI → WebSocket. Background task → no streaming needed, queue-based processing (BullMQ) + webhook.

Multi-phase streaming: it's not just text

A modern AI agent's response is not a single continuous text stream. The typical flow:

Thinking (200–500ms): the LLM decides which tools to call
Tool call (100–2000ms): CRM query, RAG search, external API
Result processing (~50ms): the tool response goes back to the LLM
Possibly another thinking + tool call
Final answer generation (2000–8000ms): this is when token streaming starts

If the user sees nothing during phases 1–4, the experience is just as bad as no streaming at all. The fix: send two kinds of events on the stream.

Phase 1 (status events) — text feedback about what the system is doing:

Client ← {"type": "status", "message": "Looking up customer info..."}
Client ← {"type": "status", "message": "Querying CRM..."}
Client ← {"type": "status", "message": "3 hits in the knowledge base..."}

Phase 2 (token events) — the actual response, character by character:

Client ← {"type": "token", "content": "Anna"}
Client ← {"type": "token", "content": " Smith"}
Client ← {"type": "token", "content": "'s last"}
Client ← {"type": "done", "usage": {...}}

The user gets immediate feedback that the system is working — even if the first real character only arrives 3 seconds later.

This is the architectural difference between, say, ChatGPT and a "legacy" chatbot. The former says something in every phase; the latter only speaks when it's done.

The latency budget: where do the milliseconds go?

If you target 1500ms TTFB (which is acceptable for text AI chat), it pays to know where the 1500ms goes. A realistic breakdown:

Client → server network: < 50ms (EU region, CDN)
Auth + parsing: < 10ms (cached JWT, minimal middleware)
RAG + history load: < 200ms (in parallel, pgvector index)
LLM TTFT (Time to First Token): < 1000ms (streaming, optimized prompt)
Server → client: < 50ms (SSE keep-alive, no buffering)
Total: < 1500ms

The two biggest optimization wins:

1. Parallel RAG and history: don't call the vector DB, the chat history, and the CRM sequentially — they're independent, run them in parallel. If each takes 200ms, sequentially that's 600ms; in parallel, 200ms.

2. Prompt size reduction: the prefill phase (when the LLM reads the prompt) scales linearly with prompt length. An 8000-token prompt gives ~600ms TTFT, a 2000-token one ~150ms. Useful techniques: context compression, only relevant RAG hits, system prompt cache (Anthropic ~75% input token savings).

Other good practices: connection pooling to the LLM provider (don't open a new TCP connection per call), edge computing (EU region for European users), cold-start elimination (dedicated server or warm container), LLM provider failover (adapter pattern: OpenAI → Claude → Gemini).

Production traps a demo never shows you

Streaming works on a localhost demo. In production, edge cases await.

1. Load balancer timeout: SSE is a long-lived HTTP connection. If the LLM "thinks" for 30 seconds (complex tool calling iterations), the load balancer (Nginx, HAProxy, AWS ALB) may drop the connection on idle timeout. Fix: periodic heartbeat SSE comment (: prefix), or raise the proxy timeout for streaming endpoints.

2. Reconnect handling: the client's network can drop (mobile handover, tunnel, elevator). SSE EventSource auto-reconnects — but the LLM provider stream is not replayable. Tokens generated so far must be buffered (in memory or Redis) and resent based on Last-Event-ID on reconnect. Without that, the user has to start the entire response over.

3. Mid-stream errors: what happens when the LLM API throws a 500 in the middle of the stream? The user got half a response. Best practice: emit an error event on the stream, persist the partial response, show a client-side "Regenerate" button. Don't let the connection just "vanish" — communicate the error.

4. Backpressure: if the LLM generates faster than the client can consume (bad network, weak phone), tokens pile up in the TCP buffer. Node.js pipe() handles backpressure natively — but test it explicitly on edge cases (slow 3G simulation).

5. Markdown streaming problem: LLMs answer in Markdown. A half-arrived code block or table breaks rendering. Fix: an incremental Markdown parser, or ChatGPT-style dual rendering (plain text during stream, Markdown re-render at the end).

These aren't theoretical problems — every production AI system hits them in the first weeks. Better to design for them up front than to fight them as bug tickets later.

The next frontier: voice AI

If text streaming is "luxury," in voice AI it's mandatory. With audio, latency is felt directly — a one-second silence after your reply feels "robotic."

The voice AI pipeline has three phases: STT (Speech-to-Text, ~200–500ms) → LLM (~500–2000ms) → TTS (Text-to-Speech, ~200–500ms). Total: 900–3000ms — twice the comfortable voice experience.

The key tricks:

Streaming TTS: don't wait for the full text. Generate audio sentence by sentence — the first sentence speaks within 500ms
Sentence boundary detection: detect ., !, ? in the LLM stream and ship to TTS immediately
Filler sounds: "Hmm...", "Just a moment..." (Google Duplex has used this since 2018) — natural-sounding cues that hide latency
WebSocket protocol: audio is binary, SSE doesn't fit. WebRTC enters the picture when P2P latency matters

OpenAI's Realtime API (Oct 2024) and Gemini 2.0's multimodal streaming already provide full-duplex voice + text — these will be the headline UX innovations of the next 2 years.

What to take home from this article?

Five practical points you can start tomorrow morning:

Measure TTFB, not only total response time. From a perception standpoint, time to first character matters more than total length.
Use SSE for text AI chat. Don't jump to WebSocket unless you really need it (voice, bidirectional).
Stream status messages during tool calling. "Looking up...", "Querying..." — the user must not stare at a blank screen.
Define a latency budget (e.g., < 1500ms TTFB), and optimize for the biggest items: parallel RAG/history, prompt size reduction.
Design for production edge cases: heartbeat SSE comment, reconnect token buffer, error events on the stream. Don't leave six weeks of firefighting between demo and production.

The best streaming implementation is the one users don't notice — they only feel that the AI answers "instantly." Even though it's working for 8 seconds in the background.

Designing real-time AI chat or a voice agent?

The Atlosz team helps you design and implement streaming architecture — SSE/WebSocket, latency optimization, multi-phase UX, production edge cases. The full architecture is in our companion whitepaper.

Let's talk about your AI strategy →