Skip to content

[WIP] feat(manifest): support snapshot mainifest cache#386

Draft
gripleaf wants to merge 1 commit into
alibaba:mainfrom
gripleaf:feat/snapshot-manifest-cache
Draft

[WIP] feat(manifest): support snapshot mainifest cache#386
gripleaf wants to merge 1 commit into
alibaba:mainfrom
gripleaf:feat/snapshot-manifest-cache

Conversation

@gripleaf

@gripleaf gripleaf commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Purpose

In our production workload, a Paimon table may have about 60k buckets. A batch of data is imported roughly every 15 minutes, and the interval may become longer. During query planning, one scan read and decoded about 4.89 million manifest entries with 16 threads, which took around 30 seconds, while only a small subset of entries was finally kept after pruning.

This patch introduces a snapshot-level live manifest cache to reduce repeated manifest read and decode cost. The cache stores merged live manifest entries for snapshots. When scanning a newer snapshot, paimon-cpp looks up the latest cached snapshot not greater than the target snapshot. If it is the same snapshot, the scan can use it directly; otherwise paimon-
cpp incrementally builds the target snapshot by reading intermediate delta manifests.

The implementation reuses the existing byte-oriented Cache interface:

  • add CacheKind::SNAPSHOT_LIVE_MANIFEST;
  • use table_path#branch as the logical cache key;
  • store multiple snapshots in one serialized binary cache value, bounded by scan.manifest-entry-cache.max-snapshots;
  • enable the optimization when a cache is provided through ScanContextBuilder::WithCache() and scan.manifest-entry-cache.max-snapshots is greater than 0.

This trades one in-memory serialize/deserialize path for avoiding remote manifest reads plus manifest decoding. In the expected case, live entries are much fewer than all historical manifest entries; even in a conservative case, the cache is useful as long as deserializing cached live entries is cheaper than reading manifest files from remote storage and
decoding all required manifest entries again.

Tests

  • cmake --build build --target paimon-core-test -j2
  • ./build/release/paimon-core-test --gtest_filter='FileStoreScanTest.TestSnapshotLiveManifestCache:FileStoreScanTest.TestSnapshotLiveManifestCacheUsesCacheKey:TableScanTest.*:CoreOptionsTest.TestDefaultValue:CoreOptionsTest.TestFromMap:CoreOptionsTest.TestInvalidCase'
  • git -c filter.lfs.process= -c filter.lfs.clean=cat -c filter.lfs.required=false diff --check

API and Format

This change extends the public cache kind enum with CacheKind::SNAPSHOT_LIVE_MANIFEST.

It removes the separate scan.manifest-entry-cache.enabled option. The cache is controlled by scan.manifest-entry-cache.max-snapshots: values greater than 0 enable the cache path when a cache is provided through WithCache(), and 0 disables it.

It does not change table storage format, file format, or network protocol. The serialized snapshot live manifest bundle is an in-memory cache value only and can be rebuilt from manifest files if absent or evicted.

Documentation

Yes. Added/updated user guide documentation for snapshot live manifest cache under docs/source/user_guide/manifest_entry_cache.rst.

Generative AI tooling

Generated-by: Codex (GPT-5)

@gripleaf gripleaf force-pushed the feat/snapshot-manifest-cache branch 11 times, most recently from 9b7c737 to 8337c3b Compare June 30, 2026 12:50
@gripleaf gripleaf force-pushed the feat/snapshot-manifest-cache branch from 8337c3b to 5280b8e Compare June 30, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant