Token Counting

Large tool results consume significant context window space in the calling LLM. The proxy counts tokens in every tool response before it is returned to the MCP client, emitting per-tool metrics and optionally enforcing size limits.

How it works

After any jq response transform is applied, the proxy tokenizes the result using a tiktoken-compatible tokenizer. The token count is attached to the active OpenTelemetry span and emitted as a histogram metric. No result content is logged.

Configuration

token_counting:
  enabled: true
  model: gpt-4o           # tokenizer model (used for byte-pair encoding)
  max_tokens: 8192        # optional: reject results that exceed this limit

Per-tool limit override

Set a per-operation limit via the x-mcp-max-tokens extension in an overlay:

- target: "$.paths[\"/search\"].get"
  update:
    x-mcp-max-tokens: 2048    # tighter limit for a high-cardinality search endpoint

What happens when a result exceeds the limit

By default the proxy returns an MCP error instructing the client to narrow its query. Set truncate: true to truncate the result to the limit instead:

token_counting:
  enabled: true
  max_tokens: 8192
  truncate: true       # truncate instead of rejecting oversized results

Truncation operates on the post-transform JSON string at the token boundary, then appends a …[truncated] marker so the client knows the result is incomplete.

Metrics

Token counts are emitted as an OpenTelemetry histogram mcp.tool.response_tokens with a tool label. This lets you observe which tools are burning the most context budget over time in your Prometheus/Grafana stack. See OpenTelemetry.

Supported tokenizers

Model value	Encoding
`gpt-4o`	o200k_base
`gpt-4`, `gpt-3.5-turbo`	cl100k_base
`claude-3`	approximated via cl100k_base