Skip to content

[table] Support reading dedicated and rolling (multi-segment) vector files#423

Merged
JingsongLi merged 2 commits into
apache:mainfrom
JunRuiLee:feat/vector-type-pr3
Jun 30, 2026
Merged

[table] Support reading dedicated and rolling (multi-segment) vector files#423
JingsongLi merged 2 commits into
apache:mainfrom
JunRuiLee:feat/vector-type-pr3

Conversation

@JunRuiLee

@JunRuiLee JunRuiLee commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add read-path support for dedicated .vector.<format> files in data-evolution splits, including rolling (multi-segment) vector columns. Upstream Paimon's DedicatedFormatRollingFileWriter rolls a vector column into multiple .vector. segments (using an independent VECTOR_TARGET_FILE_SIZE), so the Rust reader needs to consume such tables by reassembling the segments. This mirrors upstream's VectorFileBunch non-pushdown semantics in DataEvolutionSplitRead.java.

Changes

All changes are confined to the read path (crates/paimon/src/table/data_evolution_reader.rs).

  • Classify & route vector files: is_vector_store_file_name detector; normalize_merge_group classifies .vector. files into their own group, excludes them from the merge anchor, and forces the column-merge path.
  • VectorBunch (modeled on the existing BlobBunch): aggregates rolled segments belonging to one logical vector source, keyed by (schema_id, format_suffix, normalized_write_cols). add enforces continuity/dedup (same-first_row_id lower-seq -> ignore, overlap lower-seq -> ignore, gap -> error, row-count overflow -> error, key-identity mismatch -> error).
  • Segment ordering: normalize_merge_group sorts vector segments by first_row_id asc, then max_sequence_number desc, and no longer validates rolled segments (which are slices) against the normal-file row range.
  • normalized_write_cols: computed per vector file in load_file_infos (write columns sorted by field position; missing write_cols -> DataInvalid), so differently-ordered raw write columns that normalize equal aggregate into one bunch.
  • build_source_plan: aggregates same-key segments into a single FieldSource::VectorBunch, with a bunch-granularity duplicate-field-id guard and a bunch row-count check symmetric to the blob check. A single vector file is simply a bunch of length 1 -- one code path, no special-casing.
  • Read: VectorBunch is read via the same sequential-concat path as BlobBunch; each segment carries its real first_row_id/row_count, so the existing to_local_row_ranges mechanism clips absolute row ranges per segment correctly.

Out of scope

Write-side changes; blob fallback-placeholder optimization; predicate/vector pushdown into vector files; upstream's rowIdPushDown branch (non-pushdown semantics only).

Tests

  • Unit: VectorBunch::add guards (gap / overlap-lower-seq / same-first_row_id dedup / row-count overflow / key-identity mismatch); normalize_merge_group multi-segment sort and removal of the vector range check; normalize_vector_write_cols sort-by-position and missing/unknown-column errors; build_source_plan aggregation (incl. differently-ordered write cols collapsing to one bunch) and cross-bunch duplicate-field-id rejection.
  • Integration (end-to-end): normal data.parquet + 3 rolled .vector.parquet segments reassemble into the correct column in order (including a null row); row_ranges selecting rows across a segment boundary return the correct subset.

All 41 data_evolution_reader tests pass; clippy clean.

Note: this builds on the earlier dedicated-vector classification/routing commits; the diff against main includes those as they have not yet landed upstream.

@JunRuiLee JunRuiLee marked this pull request as draft June 30, 2026 07:55
@JunRuiLee JunRuiLee force-pushed the feat/vector-type-pr3 branch from 7ad8dae to cb04deb Compare June 30, 2026 08:02
Route dedicated `*.vector.<format>` files to a vector column source in the
data-evolution read path, so vector columns stored in their own sidecar files
are materialized correctly alongside the normal data file.

- add `is_vector_store_file_name` detector (`.vector.` segment, case-insensitive)
- classify vector files in `normalize_merge_group` (normal -> vector -> blob),
  exclude them from the merge anchor, and force the column-merge path so a lone
  vector file is never raw-converted
- route vector-typed read fields to the dedicated vector file, falling back to
  an inline normal-file provider for PR2 compatibility
- cover dedicated `.vector.parquet`/`.vector.vortex` reads end-to-end

Mirrors upstream Paimon's dedicated vector-store read path.
When a vector column is rolled into multiple `*.vector.<format>` segments
(upstream `DedicatedFormatRollingFileWriter` uses an independent
`VECTOR_TARGET_FILE_SIZE`), the read path must reassemble them. This mirrors
upstream Paimon's `VectorFileBunch` non-pushdown semantics.

- add `VectorBunch` (modeled on `BlobBunch`): aggregates segments of one logical
  vector source, keyed by (schema_id, format suffix, normalized write cols), with
  continuity/dedup guards in `add` (same-first_row_id lower-seq ignore, overlap
  lower-seq ignore, gap error, row-count overflow error, key-identity mismatch error)
- sort vector segments by first_row_id asc / max_sequence_number desc in
  `normalize_merge_group`, and stop validating rolled segments (slices) against the
  normal-file row range
- compute per-file `normalized_write_cols` in `load_file_infos` (write cols sorted by
  field position; missing write_cols -> DataInvalid), so differently-ordered raw write
  cols that normalize equal aggregate into one bunch
- aggregate same-key segments into one `FieldSource::VectorBunch` in
  `build_source_plan`, with a bunch-granularity duplicate-field-id guard and a bunch
  row-count check; a single vector file is simply a bunch of length 1
- read the bunch via the same sequential-concat path as the blob bunch; each segment
  carries its real first_row_id/row_count, so `to_local_row_ranges` clips absolute row
  ranges per segment
- cover rolled reassembly and cross-boundary row_ranges end-to-end

Out of scope: write side, blob fallback-placeholder, predicate/vector pushdown,
upstream rowIdPushDown branch.
@JunRuiLee JunRuiLee force-pushed the feat/vector-type-pr3 branch from cb04deb to dede550 Compare June 30, 2026 08:09
@JunRuiLee JunRuiLee marked this pull request as ready for review June 30, 2026 08:11

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit b4377f3 into apache:main Jun 30, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants