Skip to content

fix(sse-retry): gate retry timing only on the early side#375

Open
maxisbey wants to merge 1 commit into
modelcontextprotocol:mainfrom
maxisbey:sse-retry-load-monotone
Open

fix(sse-retry): gate retry timing only on the early side#375
maxisbey wants to merge 1 commit into
modelcontextprotocol:mainfrom
maxisbey:sse-retry-load-monotone

Conversation

@maxisbey

@maxisbey maxisbey commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Problem

sse-retry flakes in python-sdk CI (~4–5% of runs, including recent failures on every push to main), always with the signature ✗ sse-retry: 2 passed, 0 failed, 1 warnings — the retry-timing check landing in its (700, 1000]ms "slightly late" WARNING band, which fails the scenario under the WARNING policy (#245). Example run: https://github.com/modelcontextprotocol/python-sdk/actions/runs/28443055485

The 2025-11-25 spec sentence is "The client MUST respect the retry field, waiting the given number of milliseconds before attempting to reconnect" — a lower bound. Neither MCP nor WHATWG SSE bounds reconnect delay from above. The measured delay is the client's sleep (≥ retry) plus EOF detection, GET propagation, and harness dispatch — all inflated by load, and suite mode runs every scenario's client concurrently. Measured locally with the python client: 503ms idle against the 700ms gate, 591–674ms with the suite pinned to 2 cores. The band gates the environment, not the client.

Change (one file, +11/−36)

  • Delete the slightly-late WARNING and >2× FAILURE bands: the timing check is now FAILURE only when the reconnect arrives earlier than retry − 50ms, otherwise SUCCESS, with the measured delay kept in details. Load can only push a compliant client deeper into the passing region, so the flake is structurally gone.
  • Demote the "could not measure timing" path from WARNING to INFO (the other spurious-warning path).
  • One-line fix: the initialize response negotiated protocolVersion: 2025-03-26, a version whose spec had no retry MUST; now 2025-11-25 (same issue 4686e06 fixed for server-sse-polling).

Strictly relaxing — no SDK that passes today can start failing on a pin bump; retry stays at 500ms so test duration is unchanged.

Trade-off accepted: the >2× band was the only (unsound) detector for clients that ignore the retry field — a client's ~1000ms default backoff sat near that threshold by coincidence, and a loaded runner pushed compliant clients over it. With the band gone, retry-field-ignorance is untested, which seems proportionate for a scenario that is removed in draft.

Verification

  • python-sdk client, real run: 3/3 SUCCESS (actualDelayMs: 502).
  • python-sdk client patched to ignore the retry hint (reconnects at its 1000ms fallback, previously red): now passes, delay recorded in details.
  • npm run test 328 passed; typecheck and lint clean.

AI Disclaimer

@pkg-pr-new

pkg-pr-new Bot commented Jun 30, 2026

Copy link
Copy Markdown

Open in StackBlitz

npx https://pkg.pr.new/@modelcontextprotocol/conformance@375

commit: 4265e85

@maxisbey maxisbey marked this pull request as ready for review June 30, 2026 14:50
@pcarleton

Copy link
Copy Markdown
Member

can we fix this more surgically?

  • this test is irrelevant in 2026-07-28 and beyond, so i don't want to invest a bunch more into it
  • SDK's have passed at this for previous tiering, so every change to an existing suite means potentially noisy failures when an SDK bumps (maybe it gets better! but we'd need to test that)
  • Bumping timeouts to be super long means every test runs for at least that long. if we are going to do this, i'd rather we categorize the test as "long running" / tiering only, so we don't run it in CI by default, and then we just run it during tiering. that's not ideal either though
  • 5+ line comments are ai smell imo

The retry MUST ("waiting the given number of milliseconds before
attempting to reconnect") is a lower bound; no spec text bounds reconnect
delay from above. The slightly-late WARNING and >2x FAILURE bands gated
environment latency, which CI load inflates - in python-sdk CI the warning
band fired on ~4-5% of runs under the suite's parallel client spawns.

- delete the late bands: FAILURE only when the reconnect arrives earlier
  than retry minus the 50ms tolerance
- demote the could-not-measure path from WARNING to INFO
- negotiate protocolVersion 2025-11-25, the version whose MUST is under
  test (was 2025-03-26; same fix as 4686e06 for server-sse-polling)

Strictly relaxing: no client that passes today can start failing.
@maxisbey maxisbey force-pushed the sse-retry-load-monotone branch from 5905aa7 to 4265e85 Compare June 30, 2026 17:41
@maxisbey maxisbey changed the title fix(sse-retry): make the retry-timing verdict one-sided and causally keyed fix(sse-retry): gate retry timing only on the early side Jun 30, 2026
@maxisbey

Copy link
Copy Markdown
Contributor Author

Descoped — the diff is now one file, +11/−36:

  • deleted the two late-side bands (the 2025-11-25 MUST is "waiting the given number of milliseconds before attempting to reconnect" — a lower bound; neither MCP nor WHATWG SSE bounds reconnect delay from above), and demoted the can't-measure warning to INFO
  • retry stays at 500ms, so test duration is unchanged — no need for a long-running/tiering-only class
  • strictly relaxing: no SDK that passes today can start failing on a bump; flaky SDKs just stop flaking
  • kept one one-line fix: the initialize response negotiated 2025-03-26, a version with no retry MUST (same issue 4686e06 fixed for server-sse-polling)

Trade-off accepted: the >2× band was the only (unsound) detector for clients that ignore the retry field; with it gone that's untested, which seems right for a scenario that's removed in draft. PR description updated to match.

AI Disclaimer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants