gh-136063: Fix quadratic complexity in the email header value parser#152521
Open
serhiy-storchaka wants to merge 2 commits into
Open
gh-136063: Fix quadratic complexity in the email header value parser#152521serhiy-storchaka wants to merge 2 commits into
serhiy-storchaka wants to merge 2 commits into
Conversation
…arser Rewrite the parser in Lib/email/_header_value_parser.py to advance through the input using indices instead of repeatedly slicing off the already-parsed prefix. Each get_*/parse_* function now takes a (value, pos) pair and returns the parsed token together with the new position, removing the O(n) remainder copy that made parsing O(n^2). As part of this change, an obsolete local-part is re-parsed from the original source text rather than from the decoded representation of the already-parsed tokens. This only affects malformed addresses that contain an RFC 2047 encoded-word inside an addr-spec, which RFC 2047 does not permit; such an encoded-word is no longer decoded and the address is reported as invalid (pythongh-152519). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
emailModule #136063The
emailheader value parser advanced through the input by repeatedly slicing off the already-parsed prefix (value = value[1:],value = ''.join(remainder)), copying the whole remainder on every step. For headers built from many small tokens (long address lists, encoded-word runs, content-type parameters) this is O(n²) and can be exploited for denial of service.This rewrites the parser to pass an index: every
get_*/parse_*now takes(value, pos)and returns the parsed token together with the new position, scanning withvalue[pos],regex.match(value, pos), andmatch.end(). No remainder is ever copied.A few things worth calling out for review:
Error messages now interpolate at most a 60-character fragment of the unparsed input (
_tail()). Several of theseHeaderParseErrors are raised and caught as ordinary control flow, so interpolating the full remainder was itself O(n) and reintroduced the quadratic behaviour. No test asserts on the message text.EncodedWord.ctenow holds just the encoded word rather than the entire remaining header (the old code assigned the whole remainder to.cte).An obsolete
local-partis now re-parsed from the original source text instead of fromstr(local_part) + remaining(the decoded rendering of the already-parsed tokens). This is both a quadratic spot and incorrect when a token rendered back to text differs from the source — e.g. an RFC 2047 encoded-word decoding to a bare special such as@, which RFC 2047 does not permit in anaddr-specanyway. Such an encoded-word is no longer decoded and the address is reported as invalid. See email: RFC 2047 encoded-word in an addr-spec local-part corrupts address parsing #152519 for the underlying correctness bug.All
test_email,test_smtplibandtest_loggingtests pass. Non-debug benchmarks show the previously quadratic headers now scale linearly (e.g.parse_content_type~8× faster at 64k,get_address_list~5.6× faster at 72k).