feat(types): support CHAR/VARCHAR/BINARY/VARBINARY in data type json parser#392
Open
SteNicholas wants to merge 1 commit into
Open
feat(types): support CHAR/VARCHAR/BINARY/VARBINARY in data type json parser#392SteNicholas wants to merge 1 commit into
SteNicholas wants to merge 1 commit into
Conversation
7f7d92a to
ad1be82
Compare
lxy-9602
reviewed
Jun 30, 2026
3dbddd7 to
53f54e5
Compare
7e44844 to
879f85c
Compare
…parser Previously CHAR/VARCHAR/BINARY/VARBINARY were registered as keywords but had no handling branch in ParseTypeByKeyword, so deserializing a schema that contained them failed with "Unsupported type: VARCHAR". Map CHAR/VARCHAR to arrow::utf8() and BINARY/VARBINARY to arrow::binary() (consistent with STRING and BYTES), parsing the optional length parameter and validating it is within [1, INT_MAX] (both inclusive), consistent with Java Paimon. Add parser test cases covering these types with and without a length argument (including the min/max length boundaries and invalid lengths), an end-to-end write/read integration test exercising the types through the full Paimon flow, and update the data type mapping doc accordingly. Use a custom raw-string delimiter (R"json(...)json") for the schema JSON in the integration test so that the embedded type strings such as "CHAR(10)" do not terminate the raw string early at their internal )" sequence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #197
CHAR,VARCHAR,BINARY, andVARBINARYwere already registered as keywords(in the
Keywordenum andKeywords()map), butParseTypeByKeywordhad nocasefor them, so they fell through to thedefaultbranch. Deserializing anyschema that contained them failed with:
This change maps them to the corresponding Arrow types, consistent with how
STRINGandBYTESare handled:CHAR/VARCHAR->arrow::utf8()BINARY/VARBINARY->arrow::binary()It reuses the existing
ParseStringType<T>()helper. Arrow has no fixed-lengthchar/binary type, so the optional length parameter (e.g. the
(10)inVARCHAR(10)) is parsed but not retained on the Arrow type. The declared lengthis validated to be within
[1, INT_MAX](both inclusive), consistent with JavaPaimon; out-of-range lengths such as
VARCHAR(0)orVARCHAR(2147483648)arerejected.
Tests
DataTypeJsonParserTest.ParseTypeAtomicTypeSuccessis extended to cover each newtype with and without a length argument (
CHAR,CHAR(10),VARCHAR,VARCHAR(10),BINARY,BINARY(10),VARBINARY,VARBINARY(10)), theinclusive length boundaries (
CHAR(1),VARCHAR(2147483647)), and invalidlengths (
VARCHAR(0),VARBINARY(0),VARCHAR(2147483648), plus the existingVARCHAR(test)).WriteAndReadInteTest.TestCharVarcharBinaryVarbinaryTypesadds an end-to-endwrite/read case (parquet/orc): it creates a table, then evolves to a hand-written
schema whose columns declare
CHAR(10)/VARCHAR(20)/BINARY(10)/VARBINARY(20)(the Arrow-based serializer always renders string/binary as
STRING/BYTES, sothe schema is written by hand), and writes then reads back rows — including a
NULLrow — through the full Paimon write/commit/scan/read flow.API and Format
No public API (
include/) or storage format/protocol change. This only broadensschema deserialization to accept type strings that previously errored.
Documentation
Yes.
docs/source/user_guide/data_types.rstis updated to markCHAR/VARCHAR->Utf8andBINARY/VARBINARY->Binaryas supported, witha note that the declared length is not enforced.
Generative AI tooling
Generated-by: Claude Code (Claude Opus 4.8)