Skip to main content

Command Palette

Search for a command to run...

Anatomy of a message

Digging through the core concept of a message in LLM

Published
Anatomy of a message

Inkwell's first real feature is eight lines of code that calls the Anthropic Messages API.

msg, err := ai.Messages.New(ctx, anthropic.MessageNewParams{
    Model:     anthropic.ModelClaudeHaiku4_5,
    MaxTokens: 1024,
    System: []anthropic.TextBlockParam{
        {Text: systemPrompt},
    },
    Messages: []anthropic.MessageParam{
        anthropic.NewUserMessage(anthropic.NewTextBlock(req.Content + "\n\n" + req.Prompt)),
    },
})

It takes a draft and an instruction, hands them to Claude, and returns the revision. Simple on the surface — but almost every field in that call has a non-obvious implication, and the response that comes back has structure that matters a lot once you move past the happy path.

Let's take those eight lines apart 🚀

The request

model

The model string determines which version of Claude handles the request. claude-haiku-4-5 is the smallest and fastest in the current family — appropriate for a writing assistant where latency matters and the task (reword this paragraph) doesn't need the reasoning depth of Sonnet or Opus.

Model choice is one of those decisions that feels premature to optimise early but compounds quickly in production. A few things worth internalising now:

Models are not interchangeable. The same prompt can produce meaningfully different output quality and length across model tiers. If you build your UX around Haiku's typical response length and then switch to Opus, you may find responses that are twice as long — which breaks your streaming budget, your latency targets, and sometimes your layout.

The model string in the response may not match the one you sent. The API returns a model field in its response body. Anthropic occasionally routes requests to newer minor versions of a model (e.g. claude-haiku-4-5-20251001 might actually be served by a patched variant). Log the response model, not just the request model.

Pricing is per model, per token direction. Input tokens and output tokens are billed separately, at rates that vary by model. We'll look at usage in the response section — that's where you actually see the numbers.

max_tokens

This is a hard ceiling on how many tokens the model can generate. If the model hits it before finishing, generation stops — not at a sentence boundary, not at a paragraph break, just stops. The stop_reason in the response will be "max_tokens" rather than "end_turn", which is the signal you should watch for.

1024 is a reasonable default for a writing assistant. A full page of prose is roughly 700–800 tokens; this gives a little headroom without leaving the connection open for long responses.

What max_tokens is not: a way to control cost. Setting it to 100 doesn't cap the bill at 100 tokens if the model finishes naturally at 80. It's only a ceiling, not a target. Actual cost comes from what the model generates, not what you permit.

system

The system prompt is separate from the conversation. Conceptually it's instruction to the model about who it is and how to behave — and the model treats it differently from user messages.

In Inkwell it's a single string:

You are a writing assistant. Help the user improve their draft.
Return only the revised text — no preamble, no explanations.

That last sentence matters. Without it, Claude often opens with "Here's a revised version of your text:" — which is helpful in a chat interface and annoying in an editor that renders the response directly. System prompts are where you establish output format contracts.

The API takes system as an array of content blocks ([]TextBlockParam) rather than a plain string. This is forward-looking: Claude supports mixed content in system prompts — text blocks, document blocks, cached blocks. For now we're using one text block.

messages

The conversation history — an array of alternating user and assistant turns. For this first version there's exactly one: the user message that combines the draft with the instruction.

anthropic.NewUserMessage(anthropic.NewTextBlock(req.Content + "\n\n" + req.Prompt))

The messages array is also how you replay conversations. To continue a revision session, you append the assistant's previous response and the user's next instruction before sending. The model has no memory between calls — the array is the memory. This is the central mechanic of multi-turn conversations.

The response

A successful call returns something like this:

{
  "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
  "type": "message",
  "role": "assistant",
  "model": "claude-haiku-4-5-20251001",
  "content": [
    {
      "type": "text",
      "text": "I would be happy to join you for coffee."
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 52,
    "output_tokens": 11
  }
}

content

An array, not a string. That's the first thing that trips people up.

It's an array because Claude can return multiple content blocks — and because the same response envelope handles text, tool calls, and images. In practice, for a text-only completion like this, there will be exactly one block with "type": "text"

stop_reason

This tells you why generation stopped. Four possible values, and they're not equally likely:

  • "end_turn" — the model decided it was done. Normal completion.

  • "max_tokens" — the model hit your max_tokens ceiling before finishing. The response is truncated. If you're displaying it directly, the user sees an incomplete revision with no indication of why.

  • "stop_sequence" — generation hit one of the custom stop strings you specified in the request (we haven't used this yet, but it's how you delimit structured outputs).

  • "tool_use" — the model is calling a tool and wants you to execute it and continue the conversation. The full agentic loop is built on this.

usage

"usage": {
  "input_tokens": 52,
  "output_tokens": 11
}

Every call costs tokens. input_tokens is how many tokens the model processed (system prompt + conversation history + the new user message). output_tokens is how many it generated. You're billed for both, at different rates.

Inkwell already logs output_tokens in the completion handler:

log.Info().Int("draft_id", rows[0].ID).Int("tokens_out", int(msg.Usage.OutputTokens)).Msg("completion saved")

Input tokens aren't logged yet — and they're the more interesting number once you introduce caching. When using prompt caching, usage gains two more fields: cache_creation_input_tokens (tokens written to cache on the first call) and cache_read_input_tokens (tokens read from cache on subsequent calls, billed at 10% of the normal input rate). The shape of usage is how you verify the cache is actually working.

Next up

In the next article we add Multi-turn conversations, which takes the messages array from a single element to a full revision history. The user says "make it shorter", the model responds, the user says "now make it more formal" — and the model has context for both instructions because the previous turns are in the array. We'll build the revision thread in Inkwell and use the revisions table.

Building with AI

Part 1 of 2

In this series, I take you behind the AI feature — exploring the API patterns, integration strategies, and production tradeoffs that power real AI-assisted products ⚡️. We build Inkwell, a writing intelligence platform, as our companion app throughout 🚀

Up next

Building with AI

From demo to product