Skip to content

gh-136063: Fix quadratic complexity in the email header value parser#152521

Open
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:email-parser-indexing
Open

gh-136063: Fix quadratic complexity in the email header value parser#152521
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:email-parser-indexing

Conversation

@serhiy-storchaka

Copy link
Copy Markdown
Member

The email header value parser advanced through the input by repeatedly slicing off the already-parsed prefix (value = value[1:], value = ''.join(remainder)), copying the whole remainder on every step. For headers built from many small tokens (long address lists, encoded-word runs, content-type parameters) this is O(n²) and can be exploited for denial of service.

This rewrites the parser to pass an index: every get_*/parse_* now takes (value, pos) and returns the parsed token together with the new position, scanning with value[pos], regex.match(value, pos), and match.end(). No remainder is ever copied.

A few things worth calling out for review:

  • Error messages now interpolate at most a 60-character fragment of the unparsed input (_tail()). Several of these HeaderParseErrors are raised and caught as ordinary control flow, so interpolating the full remainder was itself O(n) and reintroduced the quadratic behaviour. No test asserts on the message text.

  • EncodedWord.cte now holds just the encoded word rather than the entire remaining header (the old code assigned the whole remainder to .cte).

  • An obsolete local-part is now re-parsed from the original source text instead of from str(local_part) + remaining (the decoded rendering of the already-parsed tokens). This is both a quadratic spot and incorrect when a token rendered back to text differs from the source — e.g. an RFC 2047 encoded-word decoding to a bare special such as @, which RFC 2047 does not permit in an addr-spec anyway. Such an encoded-word is no longer decoded and the address is reported as invalid. See email: RFC 2047 encoded-word in an addr-spec local-part corrupts address parsing #152519 for the underlying correctness bug.

All test_email, test_smtplib and test_logging tests pass. Non-debug benchmarks show the previously quadratic headers now scale linearly (e.g. parse_content_type ~8× faster at 64k, get_address_list ~5.6× faster at 72k).

…arser

Rewrite the parser in Lib/email/_header_value_parser.py to advance
through the input using indices instead of repeatedly slicing off the
already-parsed prefix.  Each get_*/parse_* function now takes a (value,
pos) pair and returns the parsed token together with the new position,
removing the O(n) remainder copy that made parsing O(n^2).

As part of this change, an obsolete local-part is re-parsed from the
original source text rather than from the decoded representation of the
already-parsed tokens.  This only affects malformed addresses that
contain an RFC 2047 encoded-word inside an addr-spec, which RFC 2047
does not permit; such an encoded-word is no longer decoded and the
address is reported as invalid (pythongh-152519).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant