Skip to content

De-flake conformance CI: solo re-verification, spawn-storm reduction, result artifacts#3043

Merged
maxisbey merged 4 commits into
mainfrom
conformance-ci-deflake
Jul 1, 2026
Merged

De-flake conformance CI: solo re-verification, spawn-storm reduction, result artifacts#3043
maxisbey merged 4 commits into
mainfrom
conformance-ci-deflake

Conversation

@maxisbey

@maxisbey maxisbey commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

De-flakes the conformance CI entirely on the python-sdk side: unexpected suite failures are re-verified solo before failing the job, the interpreter spawn storm that causes the timing flakes is reduced at the source, results are uploaded as artifacts whenever they matter, and the one server-side race in the everything-server is fixed.

Motivation and Context

sse-retry fails the client conformance job on ~4–5% of runs (including pushes to main, e.g. https://github.com/modelcontextprotocol/python-sdk/actions/runs/28443055485), always the same way: the harness runs all ~40 scenarios' client subprocesses concurrently on a 2-vCPU runner, and the resulting contention pushes the scenario's wall-clock reconnect measurement into its warning band. The client is spec-compliant — the overshoot is scheduler latency, not SDK behavior (idle the measured delay is ~503ms against a 700ms gate; with the suite pinned to 2 cores it reaches ~670ms+).

Three changes, all local to this repo:

  1. run-client.sh — solo re-verification, not blind retries. On suite failure, each scenario listed under "Unexpected failures (not in baseline)" is re-run alone on the then-quiet runner (once by default; CONFORMANCE_SOLO_ATTEMPTS overrides). Solo, a compliant client passes the timing check deterministically, while anything genuinely broken fails solo too — so this is flake detection, not flake hiding. Guard rails so it cannot green-wash anything else: stale-baseline errors are never retried; an empty or unparseable failure list passes the original exit code through; extracted names are validated before use; a rescued run writes a FLAKE_RESCUED marker into the results dir so the rescue is visible and its evidence is uploaded. The deliberate trade-off: failures that only reproduce under 40-way parallel load are excused — the suite asserts spec compliance, not behavior under contention.
  2. Spawn-storm reduction. uv sync --compile-bytecode (plus pre-compiling the editable sources) stops ~40 fresh interpreters racing to byte-compile the same modules during the measurement window, and the client command execs .venv/bin/python directly instead of paying uv run's lockfile re-check in every concurrent spawn. This shrinks the contention that causes the flake in the first place.
  3. related_request_id in the everything-server's test_sampling. Without it, the sampling request routes to the standalone GET stream and is silently dropped if the harness client hasn't finished opening that stream (~2ms race), hanging tools-call-sampling for the full 60s client timeout — the server-side flake observed on main.

Both jobs now save --output-dir results and upload them when the job fails or a flake was rescued, so every future incident carries its measured numbers (today the suite summary prints only counts for warning-level findings).

Related: the upstream scenario fix is proposed in modelcontextprotocol/conformance#375; this PR makes python-sdk CI robust regardless of when/how that lands, and the wrapper simply never fires once the upstream fix is pinned.

How Has This Been Tested?

The wrapper was driven end-to-end against a locally-packed harness tarball identical in behavior to the CI pin, with real and simulated clients:

  • healthy suite → pass-through, exit 0 (421 checks, baseline clean)
  • one-shot-flaky client → suite red, solo rerun passes → exit 0, FLAKE_RESCUED marker written, solo results saved
  • persistently-broken client → both solo attempts fail → exit 1, solo evidence saved, no marker
  • stale baseline entry co-occurring with a flake → exit 1 without any reruns ("not retrying")
  • infra-shaped failure (no unexpected-failures section) → original exit code passed through
  • CONFORMANCE_PKG unset → clean guard message; equals-form flags filtered; shellcheck clean

The everything-server fix was verified both directions over streamable HTTP with no GET stream open: on the base code the sampling request is dropped and the call times out; with the fix it arrives on the tools/call POST stream and completes (also passes the harness's tools-call-sampling scenario 1/1).

Breaking Changes

None. CI-only plus one example-server fix.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed (n/a — CI internals)

Additional context

The solo reruns intentionally drop --expected-failures: a scenario in the unexpected list is by definition not baselined, so the solo verdict needs no baseline context (and single-scenario semantics are a strict superset — warnings and client errors still fail). Solo results land in <output-dir>-solo next to the suite's, so uploaded artifacts are self-contained for both the rescued and the still-fails cases.

AI Disclaimer

… result artifacts

The client conformance suite launches every scenario's client subprocess
concurrently; on a 2-vCPU runner the resulting contention intermittently
pushes sse-retry's reconnect-timing measurement past its tolerance and
fails the job (~4-5% of runs, including pushes to main).

- .github/actions/conformance/run-client.sh: wraps the client suite legs.
  On failure, scenarios listed as unexpected failures are re-run alone on
  the then-quiet runner: a real failure fails again; a contention artifact
  passes and the job goes green with a FLAKE_RESCUED marker saved into the
  results directory. Stale-baseline errors and infra failures are never
  retried, and an unparseable failure list falls back to the original exit
  code, so the wrapper cannot green-wash anything but solo-verified passes.
- conformance.yml: uv sync gains --compile-bytecode and the editable
  sources are pre-compiled, so ~40 concurrent interpreters stop racing to
  byte-compile the same modules during the measurement window; the client
  command execs the synced venv's interpreter directly instead of paying
  uv's lockfile re-check in every spawn; both jobs save --output-dir
  results and upload them when the job fails or a flake was rescued.
- everything-server: test_sampling passes related_request_id so the
  sampling request rides the originating tools/call SSE stream instead of
  racing the client's standalone GET stream (a dropped request there hangs
  tools-call-sampling to the 60s client timeout).
@maxisbey maxisbey marked this pull request as ready for review July 1, 2026 14:27
Comment thread .github/actions/conformance/run-client.sh
…tempt

A solo failure on the quiet runner already disproves the contention
hypothesis, so a second attempt was the blind retry the wrapper exists to
avoid; CONFORMANCE_SOLO_ATTEMPTS remains the override.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 log.txt:1-12 — This PR adds a stray log.txt at the repository root containing a fake conformance-suite summary (ANSI codes, "Unexpected failures (not in baseline):", "✗ sse-retry") — it looks like a local fixture used while developing run-client.sh's log parsing that was committed by accident. Nothing references it and the PR description doesn't mention it; please drop it from the PR (or move it under a fixtures location if it's meant to be used by a test).

    Extended reasoning...

    What the file is. The diff adds a new 12-line log.txt at the repository root. Its contents are a synthetic conformance-suite summary: a "Total: 1 passed, 2 failed, 0 warnings" line, raw ANSI escape sequences (\x1b[33m, \x1b[31m), an "Expected failures (in baseline):" block, a "Stale baseline entries (now passing - remove from baseline):" block, and an "Unexpected failures (not in baseline):" block listing ✗ sse-retry. This is exactly the summary format that the new .github/actions/conformance/run-client.sh parses with its sed/grep pipeline (the sed -n '/^Unexpected failures (not in baseline):$/,/^$/p' extraction, the ^Stale baseline entries guard, and the ANSI-stripping sed 's/\x1b\[[0-9;]*m//g'), so it is almost certainly a hand-crafted local fixture used while developing and testing the wrapper's parsing logic.\n\nNothing uses it. Grepping the repository for log.txt finds only unrelated matches (pip-log.txt in .gitignore and errlog.txt fixtures under tests/transports/stdio/). Neither run-client.sh, conformance.yml, nor any test references a repo-root log.txt — the wrapper writes its own log to a mktemp file at runtime, so this file plays no role in CI or in any test.\n\nIt was not intended to be part of the PR. The PR description enumerates the changes as "CI-only plus one example-server fix" and lists exactly three changes (the run-client.sh wrapper, the workflow spawn-storm/artifact changes, and the everything-server related_request_id fix); log.txt is not mentioned anywhere. It is also not covered by .gitignore (the existing pip-log.txt entry doesn't match, and there is no *.txt/log.txt pattern), which is how it slipped into the commit.\n\nImpact. Nothing breaks functionally — the file is inert. The cost is source-tree pollution: a confusingly named artifact at the repo root that future readers will wonder about, and that ships in the sdist/source checkout for no reason.\n\nConcrete walk-through. (1) The author develops run-client.sh locally and needs a sample suite log to exercise the ANSI-strip + section-extraction parsing, so they save one as log.txt in the repo root. (2) They run something like sed 's/\x1b\[[0-9;]*m//g' log.txt | sed -n '/^Unexpected failures/...' to confirm the extraction yields sse-retry and the stale-baseline guard fires. (3) A later git add of the working tree picks up log.txt because no ignore rule matches it. (4) The file lands in the PR unreferenced. (5) After merge, every checkout carries a stray log.txt at the root that no code reads.\n\nFix. Remove log.txt from the PR (git rm log.txt). If the author wants to keep a parsing fixture, move it under .github/actions/conformance/ (or a tests fixtures directory) with a descriptive name and actually reference it from a test of the wrapper's parsing — but as-is it should simply be deleted.

grep -q quitting at the first match SIGPIPEs the printf feeding it; with
pipefail the pipeline reports 141 and the guard is skipped exactly when
the pattern is present. A here-string has no writer to kill.
@maxisbey

maxisbey commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Re the stray log.txt finding: removed in b4558ce (it was a review-time parsing fixture swept in by a bulk git add), and conformance-results/ is now gitignored so local repro runs cannot repeat the pattern.

@maxisbey maxisbey enabled auto-merge (squash) July 1, 2026 15:58
@maxisbey maxisbey merged commit d39c68d into main Jul 1, 2026
35 checks passed
@maxisbey maxisbey deleted the conformance-ci-deflake branch July 1, 2026 15:59

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stale-baseline guard issue I flagged earlier is fixed in a2072c3 and I found no further bugs — but since the flake-rescue policy (excusing failures that only reproduce under parallel load) and the ~100-line bash wrapper are a deliberate CI design choice, this deserves a maintainer's sign-off rather than an automated approval.

Extended reasoning...

Overview

This PR is CI-only plus one example-server fix: a new run-client.sh wrapper (~104 lines of bash) that re-runs unexpected conformance failures solo before failing the job, workflow changes to reduce interpreter spawn contention (--compile-bytecode, direct .venv/bin/python invocation) and upload result artifacts, a .gitignore entry, and adding related_request_id to the everything-server's test_sampling tool so the sampling request routes to the tools/call POST stream instead of a possibly-not-yet-open GET stream.

Security risks

Low. No changes to src/ or any shipped SDK code. The workflow keeps the pinned, sha256-verified harness tarball flow unchanged, and the artifact upload only publishes conformance result files. The bash wrapper validates extracted scenario names against a strict character allowlist before reusing them in commands, which mitigates injection from harness output.

Level of scrutiny

Moderate. Nothing here affects users of the SDK, but the wrapper encodes a policy decision — failures that only reproduce under 40-way parallel load are excused after passing solo — which trades a small amount of failure-detection fidelity for CI stability. That trade-off, and whether the guard rails (stale-baseline check, name validation, pass-through on unparseable output, FLAKE_RESCUED marker + artifact upload) are sufficient to prevent green-washing real regressions, is a judgment call maintainers should own. The bash logic is also intricate enough (arg re-parsing, PIPESTATUS handling, pipefail interactions — one of which I flagged and the author fixed) that a human read is warranted.

Other factors

My earlier inline comment about the SIGPIPE/pipefail fail-open in the stale-baseline guard was addressed in a2072c3 with a here-string and re-verified by the author; the current diff reflects the fix. The bug hunting system found no issues on the current revision. The PR description documents extensive local end-to-end testing of the wrapper's pass-through, rescue, still-broken, stale-baseline, and infra-failure paths, and the everything-server fix was verified in both directions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants