[table] Support reading dedicated and rolling (multi-segment) vector files by JunRuiLee · Pull Request #423 · apache/paimon-rust

JunRuiLee · 2026-06-30T07:25:16Z

Purpose

Add read-path support for dedicated .vector.<format> files in data-evolution splits, including rolling (multi-segment) vector columns. Upstream Paimon's DedicatedFormatRollingFileWriter rolls a vector column into multiple .vector. segments (using an independent VECTOR_TARGET_FILE_SIZE), so the Rust reader needs to consume such tables by reassembling the segments. This mirrors upstream's VectorFileBunch non-pushdown semantics in DataEvolutionSplitRead.java.

Changes

All changes are confined to the read path (crates/paimon/src/table/data_evolution_reader.rs).

Classify & route vector files: is_vector_store_file_name detector; normalize_merge_group classifies .vector. files into their own group, excludes them from the merge anchor, and forces the column-merge path.
VectorBunch (modeled on the existing BlobBunch): aggregates rolled segments belonging to one logical vector source, keyed by (schema_id, format_suffix, normalized_write_cols). add enforces continuity/dedup (same-first_row_id lower-seq -> ignore, overlap lower-seq -> ignore, gap -> error, row-count overflow -> error, key-identity mismatch -> error).
Segment ordering: normalize_merge_group sorts vector segments by first_row_id asc, then max_sequence_number desc, and no longer validates rolled segments (which are slices) against the normal-file row range.
normalized_write_cols: computed per vector file in load_file_infos (write columns sorted by field position; missing write_cols -> DataInvalid), so differently-ordered raw write columns that normalize equal aggregate into one bunch.
build_source_plan: aggregates same-key segments into a single FieldSource::VectorBunch, with a bunch-granularity duplicate-field-id guard and a bunch row-count check symmetric to the blob check. A single vector file is simply a bunch of length 1 -- one code path, no special-casing.
Read: VectorBunch is read via the same sequential-concat path as BlobBunch; each segment carries its real first_row_id/row_count, so the existing to_local_row_ranges mechanism clips absolute row ranges per segment correctly.

Out of scope

Write-side changes; blob fallback-placeholder optimization; predicate/vector pushdown into vector files; upstream's rowIdPushDown branch (non-pushdown semantics only).

Tests

Unit: VectorBunch::add guards (gap / overlap-lower-seq / same-first_row_id dedup / row-count overflow / key-identity mismatch); normalize_merge_group multi-segment sort and removal of the vector range check; normalize_vector_write_cols sort-by-position and missing/unknown-column errors; build_source_plan aggregation (incl. differently-ordered write cols collapsing to one bunch) and cross-bunch duplicate-field-id rejection.
Integration (end-to-end): normal data.parquet + 3 rolled .vector.parquet segments reassemble into the correct column in order (including a null row); row_ranges selecting rows across a segment boundary return the correct subset.

All 41 data_evolution_reader tests pass; clippy clean.

Note: this builds on the earlier dedicated-vector classification/routing commits; the diff against main includes those as they have not yet landed upstream.

Route dedicated `*.vector.<format>` files to a vector column source in the data-evolution read path, so vector columns stored in their own sidecar files are materialized correctly alongside the normal data file. - add `is_vector_store_file_name` detector (`.vector.` segment, case-insensitive) - classify vector files in `normalize_merge_group` (normal -> vector -> blob), exclude them from the merge anchor, and force the column-merge path so a lone vector file is never raw-converted - route vector-typed read fields to the dedicated vector file, falling back to an inline normal-file provider for PR2 compatibility - cover dedicated `.vector.parquet`/`.vector.vortex` reads end-to-end Mirrors upstream Paimon's dedicated vector-store read path.

When a vector column is rolled into multiple `*.vector.<format>` segments (upstream `DedicatedFormatRollingFileWriter` uses an independent `VECTOR_TARGET_FILE_SIZE`), the read path must reassemble them. This mirrors upstream Paimon's `VectorFileBunch` non-pushdown semantics. - add `VectorBunch` (modeled on `BlobBunch`): aggregates segments of one logical vector source, keyed by (schema_id, format suffix, normalized write cols), with continuity/dedup guards in `add` (same-first_row_id lower-seq ignore, overlap lower-seq ignore, gap error, row-count overflow error, key-identity mismatch error) - sort vector segments by first_row_id asc / max_sequence_number desc in `normalize_merge_group`, and stop validating rolled segments (slices) against the normal-file row range - compute per-file `normalized_write_cols` in `load_file_infos` (write cols sorted by field position; missing write_cols -> DataInvalid), so differently-ordered raw write cols that normalize equal aggregate into one bunch - aggregate same-key segments into one `FieldSource::VectorBunch` in `build_source_plan`, with a bunch-granularity duplicate-field-id guard and a bunch row-count check; a single vector file is simply a bunch of length 1 - read the bunch via the same sequential-concat path as the blob bunch; each segment carries its real first_row_id/row_count, so `to_local_row_ranges` clips absolute row ranges per segment - cover rolled reassembly and cross-boundary row_ranges end-to-end Out of scope: write side, blob fallback-placeholder, predicate/vector pushdown, upstream rowIdPushDown branch.

JingsongLi

+1

JunRuiLee marked this pull request as draft June 30, 2026 07:55

JunRuiLee force-pushed the feat/vector-type-pr3 branch from 7ad8dae to cb04deb Compare June 30, 2026 08:02

JunRuiLee added 2 commits June 30, 2026 16:07

JunRuiLee force-pushed the feat/vector-type-pr3 branch from cb04deb to dede550 Compare June 30, 2026 08:09

JunRuiLee marked this pull request as ready for review June 30, 2026 08:11

JingsongLi approved these changes Jun 30, 2026

View reviewed changes

JingsongLi merged commit b4377f3 into apache:main Jun 30, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[table] Support reading dedicated and rolling (multi-segment) vector files#423

[table] Support reading dedicated and rolling (multi-segment) vector files#423
JingsongLi merged 2 commits into
apache:mainfrom
JunRuiLee:feat/vector-type-pr3

JunRuiLee commented Jun 30, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JunRuiLee commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Out of scope

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JunRuiLee commented Jun 30, 2026 •

edited

Loading