[table] Support reading dedicated and rolling (multi-segment) vector files#423
Merged
Merged
Conversation
7ad8dae to
cb04deb
Compare
Route dedicated `*.vector.<format>` files to a vector column source in the data-evolution read path, so vector columns stored in their own sidecar files are materialized correctly alongside the normal data file. - add `is_vector_store_file_name` detector (`.vector.` segment, case-insensitive) - classify vector files in `normalize_merge_group` (normal -> vector -> blob), exclude them from the merge anchor, and force the column-merge path so a lone vector file is never raw-converted - route vector-typed read fields to the dedicated vector file, falling back to an inline normal-file provider for PR2 compatibility - cover dedicated `.vector.parquet`/`.vector.vortex` reads end-to-end Mirrors upstream Paimon's dedicated vector-store read path.
When a vector column is rolled into multiple `*.vector.<format>` segments (upstream `DedicatedFormatRollingFileWriter` uses an independent `VECTOR_TARGET_FILE_SIZE`), the read path must reassemble them. This mirrors upstream Paimon's `VectorFileBunch` non-pushdown semantics. - add `VectorBunch` (modeled on `BlobBunch`): aggregates segments of one logical vector source, keyed by (schema_id, format suffix, normalized write cols), with continuity/dedup guards in `add` (same-first_row_id lower-seq ignore, overlap lower-seq ignore, gap error, row-count overflow error, key-identity mismatch error) - sort vector segments by first_row_id asc / max_sequence_number desc in `normalize_merge_group`, and stop validating rolled segments (slices) against the normal-file row range - compute per-file `normalized_write_cols` in `load_file_infos` (write cols sorted by field position; missing write_cols -> DataInvalid), so differently-ordered raw write cols that normalize equal aggregate into one bunch - aggregate same-key segments into one `FieldSource::VectorBunch` in `build_source_plan`, with a bunch-granularity duplicate-field-id guard and a bunch row-count check; a single vector file is simply a bunch of length 1 - read the bunch via the same sequential-concat path as the blob bunch; each segment carries its real first_row_id/row_count, so `to_local_row_ranges` clips absolute row ranges per segment - cover rolled reassembly and cross-boundary row_ranges end-to-end Out of scope: write side, blob fallback-placeholder, predicate/vector pushdown, upstream rowIdPushDown branch.
cb04deb to
dede550
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Add read-path support for dedicated
.vector.<format>files in data-evolution splits, including rolling (multi-segment) vector columns. Upstream Paimon'sDedicatedFormatRollingFileWriterrolls a vector column into multiple.vector.segments (using an independentVECTOR_TARGET_FILE_SIZE), so the Rust reader needs to consume such tables by reassembling the segments. This mirrors upstream'sVectorFileBunchnon-pushdown semantics inDataEvolutionSplitRead.java.Changes
All changes are confined to the read path (
crates/paimon/src/table/data_evolution_reader.rs).is_vector_store_file_namedetector;normalize_merge_groupclassifies.vector.files into their own group, excludes them from the merge anchor, and forces the column-merge path.VectorBunch(modeled on the existingBlobBunch): aggregates rolled segments belonging to one logical vector source, keyed by(schema_id, format_suffix, normalized_write_cols).addenforces continuity/dedup (same-first_row_idlower-seq -> ignore, overlap lower-seq -> ignore, gap -> error, row-count overflow -> error, key-identity mismatch -> error).normalize_merge_groupsorts vector segments byfirst_row_idasc, thenmax_sequence_numberdesc, and no longer validates rolled segments (which are slices) against the normal-file row range.normalized_write_cols: computed per vector file inload_file_infos(write columns sorted by field position; missingwrite_cols->DataInvalid), so differently-ordered raw write columns that normalize equal aggregate into one bunch.build_source_plan: aggregates same-key segments into a singleFieldSource::VectorBunch, with a bunch-granularity duplicate-field-id guard and a bunch row-count check symmetric to the blob check. A single vector file is simply a bunch of length 1 -- one code path, no special-casing.VectorBunchis read via the same sequential-concat path asBlobBunch; each segment carries its realfirst_row_id/row_count, so the existingto_local_row_rangesmechanism clips absolute row ranges per segment correctly.Out of scope
Write-side changes; blob fallback-placeholder optimization; predicate/vector pushdown into vector files; upstream's
rowIdPushDownbranch (non-pushdown semantics only).Tests
VectorBunch::addguards (gap / overlap-lower-seq / same-first_row_id dedup / row-count overflow / key-identity mismatch);normalize_merge_groupmulti-segment sort and removal of the vector range check;normalize_vector_write_colssort-by-position and missing/unknown-column errors;build_source_planaggregation (incl. differently-ordered write cols collapsing to one bunch) and cross-bunch duplicate-field-id rejection.data.parquet+ 3 rolled.vector.parquetsegments reassemble into the correct column in order (including a null row);row_rangesselecting rows across a segment boundary return the correct subset.All 41
data_evolution_readertests pass; clippy clean.