June 18, 2026
I maintain an MCP server for FeatureOS, a fastmcp-slim (Python) server that exposes a product-feedback platform's API as 31 tools an assistant can call: list feature requests, post comments, manage changelogs, and so on. It worked. Tools returned the right data. Tests passed.
So instead of reviewing it myself, I pointed a swarm of agents at it (roughly 60, fanned out by a multi-agent workflow) and asked them to audit it the way a model, not a human, actually experiences it. That reframing is the whole story: the bugs that surfaced weren't crashes or wrong data. They were places where the server returned the correct answer and the model still couldn't use it.
The setup: a 60-agent self-audit
- Dependency bloat. Declared vs. actually-imported packages.
- Logging consistency. Does anything share one format on one stream?
- Tool affordances. Can a model select and call each tool correctly from its schema alone?
- Pagination reachability. Can the model get past page one?
- Widget design match. Do the UI widgets use the real product tokens?
The headline bug: models read content, not structured_content
content block (the text the model reads) and structured_content (a machine-readable block for UIs and programmatic consumers). By default, the model only sees content.
Every list tool I had was doing this:
# The data is all here, but the model never sees the cursor
return ToolResult(
content=f"Found {len(items)} posts.", # all the model reads
structured_content={"posts": items, "page": page, "has_more": True}, # invisible
)def build_list_result(items, label, key, *, page=None, per_page=None, meta=None):
count = len(items)
has_more = per_page is not None and count == per_page # full page => likely more
head = f"Found {count} {label}{'s' if count != 1 else ''}."
if has_more and page is not None:
head += (
f"\n\n[Page {page}. More results available. Call this tool again "
f"with page={page + 1} and the same other arguments to get the next page.]"
)
return ToolResult(
content=head + "\n\n" + json.dumps(items, default=str), # the model reads THIS
structured_content={key: items, "page": page, "has_more": has_more},
meta=meta,
)content, not structured_content.
is_error is not optional
# Before: a 403/404 body comes back as if it were a normal result
return ToolResult(content=result)
# After: the failure is actually flagged
return ToolResult(
content=f"Error: {result.get('message')}",
structured_content=result,
is_error=True,
)is_error=True, the SDK ships the failure as an ordinary result. A model reads {"error": true, ...} as data: it'll happily report "the upvoters are: error", or read an auth failure as "zero results" and move on. One flag is the difference between the model retrying and the model hallucinating over a 404.
While we were in the descriptions, the audit caught one that was actively wrong. list_team_members told the model to assign teammates by their member ID, but the assign endpoint keys on email. The model would follow that to the letter and fail every time.
Teaching the model the domain
instructions: domain framing, an ID-provenance map, and the pagination rule:
IDs are opaque and must come from a prior tool call. Never invent them:
bucket_id <- list_buckets
tag_ids <- list_tags
changelog_id <- list_changelogs / search_changelogs
to assign a teammate, pass their email (list_team_members -> user.email), not an id.READ = {"readOnlyHint": True, "destructiveHint": False, "openWorldHint": False}
WRITE = {"readOnlyHint": False, "destructiveHint": False, "openWorldHint": False}
DESTRUCTIVE = {"readOnlyHint": False, "destructiveHint": True, "openWorldHint": False}
# openWorldHint=False: these tools touch ONE workspace, not the open web.openWorldHint=False is the easy one to get wrong; I did, at first. The question isn't whether the tool hits the network, it's whether it reaches into an unbounded, open world like web search. A workspace API is a closed domain, and hosts lean on that hint to decide what's safe to auto-approve.
When the robots introduce the bug
remove_tag_from_changelog an output_schema keyed on tag_id as an object, but the tool returns tag_id as an integer.
# What the agent generated. Looks reasonable, breaks every call:
output_schema = item_output_schema("tag_id") # declares {"tag_id": {"type": "object"}}
# But the tool actually returns:
# {"success": true, "tag_id": 61915, "changelog_id": 14420} <- tag_id is an int
# The MCP SDK validates structured_content against output_schema at runtime,
# so this throws "Output validation error" on every SUCCESS.
# Fix: a status-envelope schema with no colliding typed key
def status_output_schema():
return {
"type": "object",
"properties": {"success": {"type": "boolean"}},
"additionalProperties": True, # permissive: documents shape, never rejects
}output_schema: keep it permissive. Its job is to document the result shape for the model, not to police your own backend. A too-strict schema turns a valid response into a runtime error the model has to puzzle over. I proved the fix by round-tripping all 31 tools through an in-memory client and asserting zero output-validation errors.
Shipping it: six atomic PRs
- Dependency pin.
requirements.txtpinned the fullfastmcp2.x distribution while the code targetsfastmcp-slim3.x APIs, so a fresh build installed a different framework than the one I developed against. - Logging. The app logger,
fastmcp's Rich handler, and uvicorn were all writing to stderr in three incompatible formats. I unified them onto one stdout formatter (pluscaptureWarnings). - Pagination reachability. The
build_list_resultstory above. - Error signaling and descriptions.
is_error, plus the ID-vs-email fix. - Instructions, annotations, output schemas, and field constraints.
- Widget accent. Aligned to the exact
oklchprimary token from the webapp, not a hand-picked approximation.
Validating live, across all 31 tools
Takeaways for anyone building MCP servers
- The model reads
content, notstructured_content. Put anything it must act on (cursors, next steps, "there's more") in the text. - Always set
is_error=Trueon failures, or the model reads your 404 as data. - Add annotations, and set
openWorldHint=Falsefor closed-domain tools. - Keep
output_schemapermissive, and round-trip-test it. Strict schemas fail valid responses at runtime. - Make tool descriptions match the API's real contract (email vs. ID). The model will follow a wrong one off a cliff.
- Fan-out scales codebase-wide edits, but it will introduce bugs. The adversarial verify phase isn't optional, and neither is the human in the loop.