Streaming Responses

Until now, every revision in Inkwell has worked the same way on the wire. The client sends a request. The server calls Claude. Claude takes a few seconds. The response comes back as one JSON blob. The user sees the result appear, all at once, after the wait.

That last part is the bit worth questioning. Claude isn't generating the response in one operation — it's producing tokens one at a time, in order, over those few seconds. The response is already sequential. Buffering it on the server and delivering it as a single payload is a choice, and not always the right one.

This article switches the revision endpoint to streaming. The model keeps generating tokens at the same rate; we just stop holding them back.

Why bother

A streaming UI feels qualitatively different from a blocking one — even when the total time to the last character is identical.

Two reasons:

The first token arrives much faster than the last. Time-to-first-token (TTFT) for a Haiku response is usually a few hundred milliseconds. Total time to a 500-token response is several seconds. With a blocking call, the user stares at a spinner for the full duration. With streaming, they see prose materialising within half a second and can start reading before the model is done.

It's a signal of liveness. A spinner only tells you "something is happening." A typewriter cursor with text scrolling past it tells you exactly what is happening, in real time. If the model is going off the rails, you notice three sentences in instead of after the full response. You can cancel. You can adjust your prompt next time.

Both effects compound when responses get longer. For a writing assistant, where revisions can run several hundred tokens, the difference between "5 seconds of nothing" and "text appearing immediately" is the difference between feeling slow and feeling responsive — at the same actual latency.

How streaming works at the API level

The Anthropic API supports streaming via Server-Sent Events. You enable it by calling a different SDK method:

stream := ai.Messages.NewStreaming(ctx, params)

That's the only request-side change. The parameters are identical to a non-streaming call. What's different is the return type: *ssestream.Stream instead of *Message.

The stream surfaces events as they arrive. Each event is one of:

message_start — the response envelope without content; carries the model ID, message ID, and usage so far.
content_block_start — a new content block is about to begin (text, tool_use, etc.).
content_block_delta — incremental content for the current block. For text, this is a text_delta carrying a chunk of generated text.
content_block_stop — the current content block is finished.
message_delta — top-level updates, including the final stop_reason and the final output token count.
message_stop — the response is done.

You iterate the stream with Next() / Current(), and the SDK ships a helper that reassembles the events into a normal Message for you:

accumulator := anthropic.Message{}
for stream.Next() {
    event := stream.Current()
    if err := accumulator.Accumulate(event); err != nil {
        return err
    }
    // Forward deltas to your client here.
}
if err := stream.Err(); err != nil {
    return err
}

After the loop ends, accumulator is functionally equivalent to what a blocking call would have returned — same Content, same Usage, same StopReason. That matters for Inkwell because the persistence step doesn't change: we still save one revision row with one completion text. The wire just delivered it in pieces.

Getting it to the browser is the harder problem

Once you've got a token stream on the server, you need a way to deliver it to the client. The web platform offers exactly one purpose-built primitive for this: EventSource, the browser's native Server-Sent Events client.

There's a catch. EventSource only supports GET. It does not support a request body. And our revision request has a body — the user's prompt, the selected mode, possibly several hundred bytes of input.

You have three real options:

Manual SSE over fetch + ReadableStream. Keep the POST, parse the SSE wire format yourself in JavaScript. About 30 lines of extra client code.
Switch the streaming endpoint to GET, encode the prompt in the URL. EventSource works directly. URLs end up in logs, browser history, and Referer headers — not where you want prose-shaped user input to live.
Two-step protocol. The POST stages the operation server-side and returns a one-shot token. A separate GET, parameterised by that token, opens the stream. EventSource works, the URL is opaque.

Inkwell takes the third path. Native EventSource does the parsing, auto-reconnect on transient blips, and dev-tools integration for free. The cost is one extra request per turn and a small in-memory ticket store.

The two-step protocol

The endpoint pair:

POST /api/drafts/{id}/revisions → {"ticket": "128-bit hex"}
GET  /api/drafts/{id}/revisions/stream?ticket=$t → SSE

The POST validates that the draft exists, generates a ticket, stashes the prompt and mode in an in-memory map keyed by the ticket, and returns. No model call yet. The GET consumes the ticket (one-shot — looking it up removes it), reloads the draft and revision history fresh, and only then calls Claude.

The ticket store is small and intentionally non-durable:

type pendingRevision struct {
    DraftID int
    Prompt  string
    Mode    string
    Expires time.Time
}

type ticketStore struct {
    mu    sync.Mutex
    items map[string]pendingRevision
    ttl   time.Duration
}

func (s *ticketStore) issue(p pendingRevision) (string, error) {
    var b [16]byte
    rand.Read(b[:])
    id := hex.EncodeToString(b[:])
    p.Expires = time.Now().Add(s.ttl)
    s.mu.Lock(); s.items[id] = p; s.mu.Unlock()
    return id, nil
}

func (s *ticketStore) consume(id string) (pendingRevision, bool) {
    s.mu.Lock(); defer s.mu.Unlock()
    p, ok := s.items[id]
    if ok { delete(s.items, id) }
    if time.Now().After(p.Expires) { return pendingRevision{}, false }
    return p, ok
}

A background goroutine purges expired entries once a minute. Tickets live for five minutes — comfortably longer than any realistic gap between the POST and the EventSource opening, short enough that abandoned tickets don't accumulate.

A few properties worth naming:

One-shot. Looking up a ticket removes it. A network-level retry can't replay the LLM call (and its bill).
Opaque. 128 bits of crypto-random hex. A guessed ticket has effectively zero chance of matching anything live, even with a billion attempts.
Cheap to lose. Restart the server and pending tickets vanish. The client sees a closed connection and retries. Nothing important was on disk.

A consumed-ticket lookup returns 410 Gone. That status is deliberate: it tells the browser "this URL was valid once and isn't anymore," which means EventSource won't auto-reconnect against it. We get the right behaviour for free.

The streaming half

Once the GET handler has consumed a ticket and confirmed the draft, it switches the response into SSE mode and drives the model stream:

w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.WriteHeader(http.StatusOK)
flusher.Flush()

accumulator := anthropic.Message{}
for stream.Next() {
    event := stream.Current()
    accumulator.Accumulate(event)

    if cb, ok := event.AsAny().(anthropic.ContentBlockDeltaEvent); ok {
        if td, ok := cb.Delta.AsAny().(anthropic.TextDelta); ok {
            writeSSEEvent(w, flusher, "delta", map[string]string{"text": td.Text})
        }
    }
}

Three event kinds go on the wire:

event: delta    data: {"text":"# Caching"}
event: delta    data: {"text":" is a technique that…"}
event: done     data: {"revision_id":3,"mode":"engineer","turn":3}
event: failure  data: {"error":"…"}      ← terminal, only on errors

The Flusher is the part beginners often miss. Without it, Go's HTTP machinery buffers writes and may not send anything until the response is large enough or the handler returns. Streaming requires explicit flushing on every frame.

One naming subtlety: the application-level error event is failure, not error. EventSource reserves error for connection-level failures (network drop, server crash, 4xx/5xx on the GET). If the server emitted event: error, the client would receive it on the same listener as transport errors, with no clean way to distinguish them. Using a different name keeps the two channels separate.

What the client does

EventSource. That's it.

const { ticket } = await fetch(`/api/drafts/${id}/revisions`, {
    method:  'POST',
    headers: { 'Content-Type': 'application/json' },
    body:    JSON.stringify({ prompt, mode }),
}).then(r => r.json());

const es = new EventSource(`/api/drafts/\({id}/revisions/stream?ticket=\){ticket}`);

es.addEventListener('delta', (e) => {
    const { text } = JSON.parse(e.data);
    turn.completion += text;
});

es.addEventListener('done', (e) => {
    const data = JSON.parse(e.data);
    turn.streaming  = false;
    turn.revisionId = data.revision_id;
    es.close();
});

es.addEventListener('failure', (e) => {
    es.close();
    showError(JSON.parse(e.data).error);
});

es.addEventListener('error', () => {
    // Connection-level error. Distinct from 'failure' above.
    es.close();
    showError('connection lost; please retry');
});

The Alpine component pushes a placeholder turn into its thread array right after the POST returns, then mutates that turn in place as delta events arrive. Because Alpine tracks turn.completion as reactive state, every append re-renders the textnode. The user sees text materialising in the same <div> that was empty a moment ago. A CSS pseudo-element on .turn.streaming adds a blinking cursor at the end:

.turn.streaming .completion::after {
  content: "▋";
  animation: blink 1s steps(2, start) infinite;
}

When the done event arrives, we set streaming: false, the cursor disappears, and the turn is indistinguishable from one that was rendered synchronously.

The closing-on-done detail is worth highlighting. We call es.close() synchronously in the done handler, before the server's TCP close arrives at the browser. That sets readyState to CLOSED and prevents the browser from interpreting the subsequent connection close as a connection-level error and retrying.

What didn't change

Persistence. The revisions table still gets exactly one row per turn, inserted after the stream completes. The SDK's Message.Accumulate helper reassembles the deltas into the same Message struct a blocking call would have produced, so the persistence path is unchanged:

completion := accumulator.Content[0].Text
rows, _ := orm.Exec(conn, domain.InsertRevision(&domain.Revision{
    DraftID: id, Prompt: op.Prompt, Completion: completion, Mode: mode,
}))

This is the right shape. Streaming is a transport-layer concern. From the database's perspective, a turn is still atomic: it either succeeded and the full completion is on disk, or it failed and there's no row. The fact that the user already saw the text appear character by character is irrelevant to the storage contract.

The conversation reconstruction in buildMessages also doesn't change. When the next revision request arrives, it loads every prior revision and replays them as the conversation. Whether those revisions were streamed or not is invisible to the model — it just sees the assembled message history.

What to notice when you run it

Open the network tab and submit a revision. You'll see two requests now: a small POST that returns a JSON {ticket} body in 50–100ms, and a GET that stays in "pending" state for the duration of the stream. Click on the GET, switch to the EventStream tab, and you'll see each frame arrive in real time with timestamps. The first delta typically lands within 300–500ms; subsequent deltas come in clusters.

The clustering is something worth understanding. The SDK doesn't deliver one token per delta — Anthropic's wire format groups tokens into chunks for efficiency, and the chunks vary in size. Your UI doesn't care: each delta is just text to append. But if you log the deltas, you'll see anywhere from one to a few dozen per response, not several hundred.

What's not great about this

Two extra requests per turn might feel wasteful. In practice the staging POST is small and same-origin — it adds maybe 50ms of round trip and zero compute. For an interactive writing assistant where the model call itself takes 2-5 seconds, that's invisible.

Reconnect semantics on a one-shot LLM stream don't quite work. If the EventSource connection drops mid-stream, the browser would normally reconnect automatically — but the ticket is already consumed, so the retry hits a 410 Gone and the browser stops. The user has to retry from the UI. Real implementations handle this either by buffering the result server-side and replaying from Last-Event-ID, or by accepting the limitation. We accept it; the article 11 work on failure handling is where this gets revisited.

Server state is no longer zero. The ticket store is in-memory, but it's state — a thing that accumulates and needs cleanup. Any feature that introduces server state should pay for itself, and this one does, but it's worth noticing the moment it stops being a stateless app.

Streaming Responses

Why bother

How streaming works at the API level

Getting it to the browser is the harder problem

The two-step protocol

The streaming half

What the client does

What didn't change

What to notice when you run it

What's not great about this

Comments

Building with AI

Building with AI

More from this blog

Understanding System Prompts

Multi Turn Conversations

Anatomy of a message

Building with AI

Command Palette

Why bother

How streaming works at the API level

Getting it to the browser is the harder problem

The two-step protocol

The streaming half

What the client does

What didn't change

What to notice when you run it

What's not great about this

Comments

Building with AI

Building with AI

More from this blog