Token Counting
Large tool results consume significant context window space in the calling LLM. The proxy counts tokens in every tool response before it is returned to the MCP client, emitting per-tool metrics and optionally enforcing size limits.
How it works
After any jq response transform is applied, the proxy tokenizes the result using a tiktoken-compatible tokenizer. The token count is attached to the active OpenTelemetry span and emitted as a histogram metric. No result content is logged.
Configuration
token_counting:
enabled: true
model: gpt-4o # tokenizer model (used for byte-pair encoding)
max_tokens: 8192 # optional: reject results that exceed this limit Per-tool limit override
Set a per-operation limit via the x-mcp-max-tokens extension in an overlay:
- target: "$.paths[\"/search\"].get"
update:
x-mcp-max-tokens: 2048 # tighter limit for a high-cardinality search endpoint What happens when a result exceeds the limit
By default the proxy returns an MCP error instructing the client to narrow its query. Set truncate: true to truncate the result to the limit instead:
token_counting:
enabled: true
max_tokens: 8192
truncate: true # truncate instead of rejecting oversized results Truncation operates on the post-transform JSON string at the token boundary, then appends a …[truncated] marker so the client knows the result is incomplete.
Metrics
Token counts are emitted as an OpenTelemetry histogram mcp.tool.response_tokens with a tool label. This lets you observe which tools are burning the most context budget over time in your Prometheus/Grafana stack. See OpenTelemetry.
Supported tokenizers
| Model value | Encoding |
|---|---|
gpt-4o | o200k_base |
gpt-4, gpt-3.5-turbo | cl100k_base |
claude-3 | approximated via cl100k_base |
See also
- Tool Result Caching — avoid re-fetching large results
- OpenTelemetry — full metrics reference
- OpenAPI Overlays — per-operation limits