PARQUET-3459: Per column compression#3526
Conversation
|
Does this duplicate #3396? |
|
This pull request has been automatically marked as stale because it has had no activity for at least 2 months. If you are still working on this change or plan to move it forward, please leave a comment or push a new commit so we know to keep it open. Otherwise, this PR will be closed automatically in about one month. Thank you for your contribution to Apache Parquet! |
4f31b0f to
fa11bde
Compare
Thread a per-column compressor-provider function through the writer stack, replace the duplicate store constructors with a validating builder, and fix the copy constructor to preserve previously-dropped fields.
fa11bde to
5587e33
Compare
Hi @wgtmac , thanks for your reviews on this feature — I really appreciate the guidance. |
|
@mengna-lin Thank you for the patience! I think now it makes sense to proceed with your PR. Let me take a look then. |
| } | ||
|
|
||
| @Override | ||
| public BytesCompressor getCompressor(CompressionCodecName codecName, int level) { |
There was a problem hiding this comment.
This path bypasses DirectCodecFactory#createCompressor(...). With a direct codec factory, per-column levels for ZSTD/SNAPPY can fall back to the heap/Hadoop compressor path; please add a direct-factory override or test.
(reviewed by Codex)
| import org.apache.hadoop.conf.Configuration; | ||
| import org.apache.hadoop.io.compress.CodecPool; | ||
| import org.apache.hadoop.io.compress.CompressionCodec; | ||
| import org.slf4j.Logger; |
There was a problem hiding this comment.
These imports are out of Spotless import order; the same pattern appears in a few added test imports too, so spotless:check will likely fail.
(reviewed by Codex)
…rt order Route the level-aware getCompressor(codec, level) path through an overridable createCompressorAtLevel so DirectCodecFactory returns its direct SNAPPY/ZSTD compressors (with the level honored for ZSTD) instead of falling back to the heap/Hadoop path. Add a direct-factory test for this. Also fix Spotless import ordering flagged in review.
|
@wgtmac I addressed those comments. Please take another look when you get a chance. Thanks. |
Rationale for this change
The Parquet format supports per-column compression at the spec level, but parquet-java has always forced a single codec across all columns —
this PR exposes that existing capability.
What changes are included in this PR?
Are these changes tested?
Yes.
Unit tests cover ParquetProperties getters/copy behavior (TestParquetProperties),
CodecFactory level-aware caching and invalid level rejection (TestDirectCodecFactory),
and ColumnChunkPageWriteStore codec resolution and invalid level rejection (TestColumnChunkPageWriteStore).
Integration tests cover end-to-end data round-trips and footer metadata verification
through both the ParquetWriter builder API and ParquetOutputFormat (TestParquetWriter).
Also test with spark job
Result
Are there any user-facing changes?
Two new APIs, fully backwards compatible:
- ParquetWriter.Builder.withCompressionCodec(col, codec)
- ParquetWriter.Builder.withCompressionLevel(col, level)
(Also accessible at the lower level via ParquetProperties.Builder.)