Skip to content

feat(types): support CHAR/VARCHAR/BINARY/VARBINARY in data type json parser#392

Open
SteNicholas wants to merge 1 commit into
alibaba:mainfrom
SteNicholas:PAIMON-197
Open

feat(types): support CHAR/VARCHAR/BINARY/VARBINARY in data type json parser#392
SteNicholas wants to merge 1 commit into
alibaba:mainfrom
SteNicholas:PAIMON-197

Conversation

@SteNicholas

@SteNicholas SteNicholas commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: #197

CHAR, VARCHAR, BINARY, and VARBINARY were already registered as keywords
(in the Keyword enum and Keywords() map), but ParseTypeByKeyword had no
case for them, so they fell through to the default branch. Deserializing any
schema that contained them failed with:

deserialize failed, possibly type incompatible: parse data type failed, error msg: Invalid: Unsupported type: VARCHAR

This change maps them to the corresponding Arrow types, consistent with how
STRING and BYTES are handled:

  • CHAR / VARCHAR -> arrow::utf8()
  • BINARY / VARBINARY -> arrow::binary()

It reuses the existing ParseStringType<T>() helper. Arrow has no fixed-length
char/binary type, so the optional length parameter (e.g. the (10) in
VARCHAR(10)) is parsed but not retained on the Arrow type. The declared length
is validated to be within [1, INT_MAX] (both inclusive), consistent with Java
Paimon; out-of-range lengths such as VARCHAR(0) or VARCHAR(2147483648) are
rejected.

Tests

DataTypeJsonParserTest.ParseTypeAtomicTypeSuccess is extended to cover each new
type with and without a length argument (CHAR, CHAR(10), VARCHAR,
VARCHAR(10), BINARY, BINARY(10), VARBINARY, VARBINARY(10)), the
inclusive length boundaries (CHAR(1), VARCHAR(2147483647)), and invalid
lengths (VARCHAR(0), VARBINARY(0), VARCHAR(2147483648), plus the existing
VARCHAR(test)).

WriteAndReadInteTest.TestCharVarcharBinaryVarbinaryTypes adds an end-to-end
write/read case (parquet/orc): it creates a table, then evolves to a hand-written
schema whose columns declare CHAR(10)/VARCHAR(20)/BINARY(10)/VARBINARY(20)
(the Arrow-based serializer always renders string/binary as STRING/BYTES, so
the schema is written by hand), and writes then reads back rows — including a
NULL row — through the full Paimon write/commit/scan/read flow.

API and Format

No public API (include/) or storage format/protocol change. This only broadens
schema deserialization to accept type strings that previously errored.

Documentation

Yes. docs/source/user_guide/data_types.rst is updated to mark
CHAR/VARCHAR -> Utf8 and BINARY/VARBINARY -> Binary as supported, with
a note that the declared length is not enforced.

Generative AI tooling

Generated-by: Claude Code (Claude Opus 4.8)

Copilot AI review requested due to automatic review settings June 30, 2026 09:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@SteNicholas SteNicholas force-pushed the PAIMON-197 branch 3 times, most recently from 7f7d92a to ad1be82 Compare June 30, 2026 09:29
Comment thread src/paimon/common/types/data_type_json_parser_test.cpp
@SteNicholas SteNicholas force-pushed the PAIMON-197 branch 2 times, most recently from 3dbddd7 to 53f54e5 Compare June 30, 2026 10:05
@SteNicholas SteNicholas requested a review from lxy-9602 June 30, 2026 10:07
lxy-9602
lxy-9602 previously approved these changes Jun 30, 2026

@lxy-9602 lxy-9602 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

…parser

Previously CHAR/VARCHAR/BINARY/VARBINARY were registered as keywords but had
no handling branch in ParseTypeByKeyword, so deserializing a schema that
contained them failed with "Unsupported type: VARCHAR". Map CHAR/VARCHAR to
arrow::utf8() and BINARY/VARBINARY to arrow::binary() (consistent with STRING
and BYTES), parsing the optional length parameter and validating it is within
[1, INT_MAX] (both inclusive), consistent with Java Paimon.

Add parser test cases covering these types with and without a length argument
(including the min/max length boundaries and invalid lengths), an end-to-end
write/read integration test exercising the types through the full Paimon flow,
and update the data type mapping doc accordingly.

Use a custom raw-string delimiter (R"json(...)json") for the schema JSON in the
integration test so that the embedded type strings such as "CHAR(10)" do not
terminate the raw string early at their internal )" sequence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants