Streaming Responses
Using streaming to make a snappier user-experience

Greetings, Iโm Riyaz. Iโm a software engineer based in India ๐ฎ๐ณ.
My interests range from technology and programming to nature and travel. I am also interested in cycling, music and trekking.
Until now, every revision in Inkwell has worked the same way on the wire. The client sends a request. The server calls Claude. Claude takes a few seconds. The response comes back as one JSON blob. The user sees the result appear, all at once, after the wait.
That last part is the bit worth questioning. Claude isn't generating the response in one operation โ it's producing tokens one at a time, in order, over those few seconds. The response is already sequential. Buffering it on the server and delivering it as a single payload is a choice, and not always the right one.
This article switches the revision endpoint to streaming. The model keeps generating tokens at the same rate; we just stop holding them back.
Why bother
A streaming UI feels qualitatively different from a blocking one โ even when the total time to the last character is identical.
Two reasons:
The first token arrives much faster than the last. Time-to-first-token (TTFT) for a Haiku response is usually a few hundred milliseconds. Total time to a 500-token response is several seconds. With a blocking call, the user stares at a spinner for the full duration. With streaming, they see prose materialising within half a second and can start reading before the model is done.
It's a signal of liveness. A spinner only tells you "something is happening." A typewriter cursor with text scrolling past it tells you exactly what is happening, in real time. If the model is going off the rails, you notice three sentences in instead of after the full response. You can cancel. You can adjust your prompt next time.
Both effects compound when responses get longer. For a writing assistant, where revisions can run several hundred tokens, the difference between "5 seconds of nothing" and "text appearing immediately" is the difference between feeling slow and feeling responsive โ at the same actual latency.
How streaming works at the API level
The Anthropic API supports streaming via Server-Sent Events. You enable it by calling a different SDK method:
stream := ai.Messages.NewStreaming(ctx, params)
That's the only request-side change. The parameters are identical to a non-streaming call. What's different is the return type: *ssestream.Stream instead of *Message.
The stream surfaces events as they arrive. Each event is one of:
message_startโ the response envelope without content; carries the model ID, message ID, and usage so far.content_block_startโ a new content block is about to begin (text, tool_use, etc.).content_block_deltaโ incremental content for the current block. For text, this is atext_deltacarrying a chunk of generated text.content_block_stopโ the current content block is finished.message_deltaโ top-level updates, including the finalstop_reasonand the final output token count.message_stopโ the response is done.
You iterate the stream with Next() / Current(), and the SDK ships a helper that reassembles the events into a normal Message for you:
accumulator := anthropic.Message{}
for stream.Next() {
event := stream.Current()
if err := accumulator.Accumulate(event); err != nil {
return err
}
// Forward deltas to your client here.
}
if err := stream.Err(); err != nil {
return err
}
After the loop ends, accumulator is functionally equivalent to what a blocking call would have returned โ same Content, same Usage, same StopReason. That matters for Inkwell because the persistence step doesn't change: we still save one revision row with one completion text. The wire just delivered it in pieces.
Getting it to the browser is the harder problem
Once you've got a token stream on the server, you need a way to deliver it to the client. The web platform offers exactly one purpose-built primitive for this: EventSource, the browser's native Server-Sent Events client.
There's a catch. EventSource only supports GET. It does not support a request body. And our revision request has a body โ the user's prompt, the selected mode, possibly several hundred bytes of input.
You have three real options:
Manual SSE over
fetch+ReadableStream. Keep the POST, parse the SSE wire format yourself in JavaScript. About 30 lines of extra client code.Switch the streaming endpoint to GET, encode the prompt in the URL.
EventSourceworks directly. URLs end up in logs, browser history, andRefererheaders โ not where you want prose-shaped user input to live.Two-step protocol. The POST stages the operation server-side and returns a one-shot token. A separate GET, parameterised by that token, opens the stream.
EventSourceworks, the URL is opaque.
Inkwell takes the third path. Native EventSource does the parsing, auto-reconnect on transient blips, and dev-tools integration for free. The cost is one extra request per turn and a small in-memory ticket store.
The two-step protocol
The endpoint pair:
POST /api/drafts/{id}/revisions โ {"ticket": "128-bit hex"}
GET /api/drafts/{id}/revisions/stream?ticket=$t โ SSE
The POST validates that the draft exists, generates a ticket, stashes the prompt and mode in an in-memory map keyed by the ticket, and returns. No model call yet. The GET consumes the ticket (one-shot โ looking it up removes it), reloads the draft and revision history fresh, and only then calls Claude.
The ticket store is small and intentionally non-durable:
type pendingRevision struct {
DraftID int
Prompt string
Mode string
Expires time.Time
}
type ticketStore struct {
mu sync.Mutex
items map[string]pendingRevision
ttl time.Duration
}
func (s *ticketStore) issue(p pendingRevision) (string, error) {
var b [16]byte
rand.Read(b[:])
id := hex.EncodeToString(b[:])
p.Expires = time.Now().Add(s.ttl)
s.mu.Lock(); s.items[id] = p; s.mu.Unlock()
return id, nil
}
func (s *ticketStore) consume(id string) (pendingRevision, bool) {
s.mu.Lock(); defer s.mu.Unlock()
p, ok := s.items[id]
if ok { delete(s.items, id) }
if time.Now().After(p.Expires) { return pendingRevision{}, false }
return p, ok
}
A background goroutine purges expired entries once a minute. Tickets live for five minutes โ comfortably longer than any realistic gap between the POST and the EventSource opening, short enough that abandoned tickets don't accumulate.
A few properties worth naming:
One-shot. Looking up a ticket removes it. A network-level retry can't replay the LLM call (and its bill).
Opaque. 128 bits of crypto-random hex. A guessed ticket has effectively zero chance of matching anything live, even with a billion attempts.
Cheap to lose. Restart the server and pending tickets vanish. The client sees a closed connection and retries. Nothing important was on disk.
A consumed-ticket lookup returns 410 Gone. That status is deliberate: it tells the browser "this URL was valid once and isn't anymore," which means EventSource won't auto-reconnect against it. We get the right behaviour for free.
The streaming half
Once the GET handler has consumed a ticket and confirmed the draft, it switches the response into SSE mode and drives the model stream:
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.WriteHeader(http.StatusOK)
flusher.Flush()
accumulator := anthropic.Message{}
for stream.Next() {
event := stream.Current()
accumulator.Accumulate(event)
if cb, ok := event.AsAny().(anthropic.ContentBlockDeltaEvent); ok {
if td, ok := cb.Delta.AsAny().(anthropic.TextDelta); ok {
writeSSEEvent(w, flusher, "delta", map[string]string{"text": td.Text})
}
}
}
Three event kinds go on the wire:
event: delta data: {"text":"# Caching"}
event: delta data: {"text":" is a technique thatโฆ"}
event: done data: {"revision_id":3,"mode":"engineer","turn":3}
event: failure data: {"error":"โฆ"} โ terminal, only on errors
The Flusher is the part beginners often miss. Without it, Go's HTTP machinery buffers writes and may not send anything until the response is large enough or the handler returns. Streaming requires explicit flushing on every frame.
One naming subtlety: the application-level error event is failure, not error. EventSource reserves error for connection-level failures (network drop, server crash, 4xx/5xx on the GET). If the server emitted event: error, the client would receive it on the same listener as transport errors, with no clean way to distinguish them. Using a different name keeps the two channels separate.
What the client does
EventSource. That's it.
const { ticket } = await fetch(`/api/drafts/${id}/revisions`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt, mode }),
}).then(r => r.json());
const es = new EventSource(`/api/drafts/\({id}/revisions/stream?ticket=\){ticket}`);
es.addEventListener('delta', (e) => {
const { text } = JSON.parse(e.data);
turn.completion += text;
});
es.addEventListener('done', (e) => {
const data = JSON.parse(e.data);
turn.streaming = false;
turn.revisionId = data.revision_id;
es.close();
});
es.addEventListener('failure', (e) => {
es.close();
showError(JSON.parse(e.data).error);
});
es.addEventListener('error', () => {
// Connection-level error. Distinct from 'failure' above.
es.close();
showError('connection lost; please retry');
});
The Alpine component pushes a placeholder turn into its thread array right after the POST returns, then mutates that turn in place as delta events arrive. Because Alpine tracks turn.completion as reactive state, every append re-renders the textnode. The user sees text materialising in the same <div> that was empty a moment ago. A CSS pseudo-element on .turn.streaming adds a blinking cursor at the end:
.turn.streaming .completion::after {
content: "โ";
animation: blink 1s steps(2, start) infinite;
}
When the done event arrives, we set streaming: false, the cursor disappears, and the turn is indistinguishable from one that was rendered synchronously.
The closing-on-done detail is worth highlighting. We call es.close() synchronously in the done handler, before the server's TCP close arrives at the browser. That sets readyState to CLOSED and prevents the browser from interpreting the subsequent connection close as a connection-level error and retrying.
What didn't change
Persistence. The revisions table still gets exactly one row per turn, inserted after the stream completes. The SDK's Message.Accumulate helper reassembles the deltas into the same Message struct a blocking call would have produced, so the persistence path is unchanged:
completion := accumulator.Content[0].Text
rows, _ := orm.Exec(conn, domain.InsertRevision(&domain.Revision{
DraftID: id, Prompt: op.Prompt, Completion: completion, Mode: mode,
}))
This is the right shape. Streaming is a transport-layer concern. From the database's perspective, a turn is still atomic: it either succeeded and the full completion is on disk, or it failed and there's no row. The fact that the user already saw the text appear character by character is irrelevant to the storage contract.
The conversation reconstruction in buildMessages also doesn't change. When the next revision request arrives, it loads every prior revision and replays them as the conversation. Whether those revisions were streamed or not is invisible to the model โ it just sees the assembled message history.
What to notice when you run it
Open the network tab and submit a revision. You'll see two requests now: a small POST that returns a JSON {ticket} body in 50โ100ms, and a GET that stays in "pending" state for the duration of the stream. Click on the GET, switch to the EventStream tab, and you'll see each frame arrive in real time with timestamps. The first delta typically lands within 300โ500ms; subsequent deltas come in clusters.
The clustering is something worth understanding. The SDK doesn't deliver one token per delta โ Anthropic's wire format groups tokens into chunks for efficiency, and the chunks vary in size. Your UI doesn't care: each delta is just text to append. But if you log the deltas, you'll see anywhere from one to a few dozen per response, not several hundred.
What's not great about this
Two extra requests per turn might feel wasteful. In practice the staging POST is small and same-origin โ it adds maybe 50ms of round trip and zero compute. For an interactive writing assistant where the model call itself takes 2-5 seconds, that's invisible.
Reconnect semantics on a one-shot LLM stream don't quite work. If the EventSource connection drops mid-stream, the browser would normally reconnect automatically โ but the ticket is already consumed, so the retry hits a 410 Gone and the browser stops. The user has to retry from the UI. Real implementations handle this either by buffering the result server-side and replaying from Last-Event-ID, or by accepting the limitation. We accept it; the article 11 work on failure handling is where this gets revisited.
Server state is no longer zero. The ticket store is in-memory, but it's state โ a thing that accumulates and needs cleanup. Any feature that introduces server state should pay for itself, and this one does, but it's worth noticing the moment it stops being a stateless app.



