Releases: apache/iceberg
Apache Iceberg 1.5.2
The 1.5.2 release has the same changes that the 1.5.1 release has. The 1.5.1 release had issues with the spark runtime artifacts; specifically certain artifacts were built with the wrong Scala version. It is strongly recommended to upgrade to 1.5.2 for any systems that are using 1.5.1.
Apache Iceberg 1.5.1
What's Changed
- [1.5.x] API: Fix default FileIO#newInputFile ManifestFile, DataFile and DeleteFile implementations by @amogh-jahagirdar in #10114
- [1.5.x] Core: Mark 502 and 504 failures as retryable to the exponential retry strategy by @amogh-jahagirdar in #10113
- Core: Fix JDBC Catalog table commit when migrating from schema V0 to V1 (#101111) by @jbonofre in #10152
- Core: Fix namespace SQL statement using ESCAPE character that works with MySQL/PostgreSQL (#10167) by @jbonofre in #10169
- (1.5.x cherry-pick) Spark 3.5: Fix system function pushdown in CoW row-level commands by @amogh-jahagirdar in #10170
- (1.5.x Cherry-pick) Spark 3.4: Fix system function pushdown in CoW row-level commands (#10119) by @amogh-jahagirdar in #10171
Full Changelog: apache-iceberg-1.5.0...apache-iceberg-1.5.1
Apache Iceberg 1.5.0
Apache Iceberg 1.5.0 was released on March 11, 2024.
The 1.5.0 release adds a variety of new features and bug fixes.
- API
- Core
- Add view support for REST catalog (#7913)
- Add view support for JDBC catalog (#9487)
- Add catalog type for glue,jdbc,nessie (#9647)
- Support Avro file encryption with AES GCM streams (#9436)
- Add ApplyNameMapping for Avro (#9347)
- Add StandardEncryptionManager (#9277)
- Add REST catalog table session cache (#8920)
- Support view metadata compression (#8552)
- Track partition statistics in TableMetadata (#8502)
- Enable column statistics filtering after planning (#8803)
- Spark
- Remove support for Spark 3.2 (#9295)
- Support views via SQL for Spark 3.4 and 3.5 (#9423, #9421, #9343, #9513, #9582)
- Support executor cache locality (#9563)
- Added support for delete manifest rewrites (#9020)
- Support encrypted output files (#9435)
- Add Spark UI metrics from Iceberg scan metrics (#8717)
- Parallelize reading files in add_files procedure (#9274)
- Support file and partition delete granularity (#9384)
- Flink
- Parquet
- Kafka-Connect
- Spec
- Vendor Integrations
- AWS: Support setting description for Glue table (#9530)
- AWS: Update S3FileIO test to run when CLIENT_FACTORY is not set (#9541)
- AWS: Add S3 Access Grants Integration (#9385)
- AWS: Glue catalog strip trailing slash on DB URI (#8870)
- Azure: Add FileIO that supports ADLSv2 storage (#8303)
- Azure: Make ADLSFileIO implement DelegateFileIO (#8563)
- Nessie: Support views for NessieCatalog (#8909)
- Nessie: Strip trailing slash for warehouse location (#9415)
- Nessie: Infer default API version from URI (#9459)
- Dependencies
- Bump Nessie to 0.77.1
- Bump ORC to 1.9.2
- Bump Arrow to 15.0.0
- Bump AWS Java SDK to 2.24.5
- Bump Azure Java SDK to 1.2.20
- Bump Google cloud libraries to 26.28.0
Note:
- To enable view support for JDBC catalog, configure
jdbc.schema-version
toV1
in catalog properties.
New Contributors
- @reswqa made their first contribution in #7745
- @maxdebayser made their first contribution in #7796
- @mderoy made their first contribution in #7801
- @cxzl25 made their first contribution in #7825
- @tilman151 made their first contribution in #7781
- @TaoZex made their first contribution in #7761
- @Rondiz made their first contribution in #7829
- @grobgl made their first contribution in #7645
- @guiyanakuang made their first contribution in #7839
- @littlecatjianjiao made their first contribution in #7908
- @DaVincii made their first contribution in #7874
- @mumuhhh made their first contribution in #7866
- @Ewan-Keith made their first contribution in #7917
- @nikam14 made their first contribution in #7093
- @hsiang-c made their first contribution in #7920
- @ktk1012 made their first contribution in #8026
- @joan38 made their first contribution in #8002
- @coded9 made their first contribution in #8058
- @rustyconover made their first contribution in #8074
- @mr-brobot made their first contribution in #8061
- @Neuw84 made their first contribution in #7988
- @lintingbin made their first contribution in #8111
- @mrcnc made their first contribution in #8193
- @s-akhtar-baig made their first contribution in #8205
- @MaxNevermind made their first contribution in #7694
- @bmaisonn made their first contribution in #8209
- @HonahX made their first contribution in #8215
- @onerishabh made their first contribution in #8214
- @kengtin made their first contribution in #7161
- @aless10 made their first contribution in #8286
- @advancedxy made their first contribution in #8320
- @dacort made their first contribution in #8341
- @gegef2009 made their first contribution in #8154
- @TjuAachen made their first contribution in #8401
- @baiyangtx made their first contribution in #8416
- @hiteshbedre made their first contribution in #8491
- @harshm-dev made their first contribution in #8385
- @wForget made their first contribution in #8445
- @andreacfm made their first contribution in #8528
- @Paddy0523 made their first contribution in #8547
- @rushilshah1 made their first contribution in #8589
- @lanemoseley made their first contribution in #8618
- @tlm365 made their first contribution in #8447
- @jbonofre made their first contribution in #8612
- @jayceslesar made their first contribution in #8558
- @MehulBatra made their first contribution in #8408
- @clettieri made their first contribution in #8192
- @nk1506 made their first contribution in #8640
- @johanhenriksson made their first contribution in #8751
- @ashutosh-roy made their first contribution in #8707
- @Priyansh121096 made their first contribution in #8748
- @PickBas made their first contribution in #8819
- @jongwooo made their first contribution in #8666
- @rice668 made their first contribution in #8873
- @geruh made their first contribution in #8914
- @bknbkn made their first contribution in #8868
- @wangtaohz made their ...
Apache Iceberg 1.4.3
What's Changed
- Core: Scan only live entries in partitions table (#8969) by @Fokko in #9197
- [1.4.x] Core: Fix missing files from transaction retries with conflicting manifest merges (#9230) by @nastra in #9337
- [1.4.x] JDBC Catalog: Fix namespaceExists check with special characters (#8340) by @ismailsimsek in #9291
- [1.4.x] Core: Expired Snapshot files in a transaction should be deleted by @bartash in #9223
- [1.4.x] Core: Fix missing delete files from transaction (#9354) by @nastra in #9356
Full Changelog: apache-iceberg-1.4.2...apache-iceberg-1.4.3
Apache Iceberg 1.4.2
What's Changed
- Core: Ignore split offsets array when split offset is past file length by @amogh-jahagirdar in #8938
Full Changelog: apache-iceberg-1.4.1...apache-iceberg-1.4.2
Apache Iceberg 1.4.1
What's Changed
- Core: Do not use a lazy split offset list in manifests (#8834) by @nastra in #8845
- Core: Ignore split offsets when the last split offset is past the file length by @amogh-jahagirdar in #8861
- AWS: avoid static global credentials provider which doesn't play well with lifecycle management (#8677) by @nastra in #8843
- Flink: Reverting the default custom partitioner for bucket column (#8848) by @nastra in #8858
Full Changelog: apache-iceberg-1.4.0...apache-iceberg-1.4.1
Apache Iceberg 1.4.0
- API
- Core
- Use V2 format by default in new tables (#8381)
- Use
zstd
compression for Parquet by default in new tables (#8593) - Add strict metadata cleanup mode and enable it by default (#8397) (#8599)
- Avoid generating huge manifests during commits (#6335)
- Add a writer for unordered position deletes (#7692)
- Optimize
DeleteFileIndex
(#8157) - Optimize lookup in
DeleteFileIndex
without useful bounds (#8278) - Optimize split offsets handling (#8336)
- Optimize computing user-facing state in data tasks (#8346)
- Don't persist useless file and position bounds for deletes (#8360)
- Don't persist counts for paths and positions in position delete files (#8590)
- Support setting system-level properties via environmental variables (#5659)
- Add JSON parser for
ContentFile
andFileScanTask
(#6934) - Add REST spec and request for commits to multiple tables (#7741)
- Add REST API for committing changes against multiple tables (#7569)
- Default to exponential retry strategy in REST client (#8366)
- Support registering tables with REST session catalog (#6512)
- Add last updated timestamp and snapshot ID to partitions metadata table (#7581)
- Add total data size to partitions metadata table (#7920)
- Extend
ResolvingFileIO
to support bulk operations (#7976) - Key metadata in Avro format (#6450)
- Add AES GCM encryption stream (#3231)
- Fix a connection leak in streaming delete filters (#8132)
- Fix lazy snapshot loading history (#8470)
- Fix unicode handling in HTTPClient (#8046)
- Fix paths for unpartitioned specs in writers (#7685)
- Fix OOM caused by Avro decoder caching (#7791)
- Spark
- Added support for Spark 3.5
- Code for DELETE, UPDATE, and MERGE commands has moved to Spark, and all related extensions have been dropped from Iceberg.
- Support for WHEN NOT MATCHED BY SOURCE clause in MERGE.
- Column pruning in merge-on-read operations.
- Ability to request a bigger advisory partition size for the final write to produce well-sized output files without harming the job parallelism.
- Dropped support for Spark 3.1
- Deprecated support for Spark 3.2
- Support vectorized reads for merge-on-read operations in Spark 3.4 and 3.5 (#8466)
- Increase default advisory partition size for writes in Spark 3.5 (#8660)
- Support distributed planning in Spark 3.4 and 3.5 (#8123)
- Support pushing down system functions by V2 filters in Spark 3.4 and 3.5 (#7886)
- Support fanout position delta writers in Spark 3.4 and 3.5 (#7703)
- Use fanout writers for unsorted tables by default in Spark 3.5 (#8621)
- Support multiple shuffle partitions per file in compaction in Spark 3.4 and 3.5 (#7897)
- Output net changes across snapshots for carryover rows in CDC (#7326)
- Display read metrics on Spark SQL UI (#7447) (#8445)
- Adjust split size to benefit from cluster parallelism in Spark 3.4 and 3.5 (#7714)
- Add
fast_forward
procedure (#8081) - Support filters when rewriting position deletes (#7582)
- Support setting current snapshot with ref (#8163)
- Make backup table name configurable during migration (#8227)
- Add write and SQL options to override compression config (#8313)
- Correct partition transform functions to match the spec (#8192)
- Enable extra commit properties with metadata delete (#7649)
- Added support for Spark 3.5
- Flink
- Add possibility of ordering the splits based on the file sequence number (#7661)
- Fix serialization in
TableSink
with anonymous object (#7866) - Switch to
FileScanTaskParser
for JSON serialization ofIcebergSourceSplit
(#7978) - Custom partitioner for bucket partitions (#7161)
- Implement data statistics coordinator to aggregate data statistics from operator subtasks (#7360)
- Support alter table column (#7628)
- Parquet
- ORC
- Handle filters with transforms by assuming the filter matches (#8244)
- Vendor Integrations
- GCP: Fix single byte read in
GCSInputStream
(#8071) - GCP: Add properties for OAtuh2 and update library (#8073)
- GCP: Add prefix and bulk operations to
GCSFileIO
(#8168) - GCP: Add bundle jar for GCP-related dependencies (#8231)
- GCP: Add range reads to
GCSInputStream
(#8301) - AWS: Add bundle jar for AWS-related dependencies (#8261)
- AWS: support config storage class for
S3FileIO
(#8154) - AWS: Add
FileIO
tracker/closer to Glue catalog (#8315) - AWS: Update S3 signer spec to allow an optional string body in
S3SignRequest
(#8361) - Azure: Add
FileIO
that supports ADLSv2 storage (#8303) - Azure: Make
ADLSFileIO
implementDelegateFileIO
(#8563) - Nessie: Provide better commit message on table registration (#8385)
- GCP: Fix single byte read in
- Dependencies
- Bump Nessie to 0.71.0
- Bump ORC to 1.9.1
- Bump Arrow to 12.0.1
- Bump AWS Java SDK to 2.20.131
Apache Iceberg 1.3.1
What's Changed
- Hive: Set commit state as Unknown before throwing CommitStateUnknownException by @nastra in #8029
- Spark 3.4: WAP branch not propagated when using DELETE without WHERE by @nastra in #8028
- Core: Include all reachable snapshots with v1 format and REF snapshot mode by @nastra in #8027
- Spark 3.3: Backport 'WAP branch not propagated when using DELETE without WHERE' by @nastra in #8036
- Flink: Remove the creation of default database in FlinkCatalog by @Fokko in #8039
- Core: Handle optional fields by @Fokko in #8064
- Core: Abort file groups should be under same lock as committerService by @ConeyLiu in #8060
- Spark 3.3: Fix rewrite_position_deletes for certain partition types by @szehon-ho in #8069
- Spark 3.4: Fix rewrite_position_deletes for certain partition types by @szehon-ho in #8059
Full Changelog: apache-iceberg-1.3.0...apache-iceberg-1.3.1
Apache Iceberg 1.3.0
What's Changed
- Nessie: Remove compile-time Hadoop dependency by @nastra in #7054
- Core: Fix deprecation message by @nastra in #7104
- Build: Update ORC to 1.8.3 by @williamhyun in #7124
- AWS: Use Apache HTTP client as default AWS HTTP client by @singhpk234 in #7119
- AWS: Enable virtual-host-style requests for MinioContainer by @nastra in #7125
- Flink: Bump to Flink 1.15.3 by @Fokko in #7059
- Flink: Bump to Flink 1.16.1 by @Fokko in #7057
- Core: Use unknown report type for forward-compatibility by @nastra in #7145
- Aliyun: Remove AssertHelpers by @liuxiaocs7 in #7116
- dell: remove usage of AssertHelpers by @liuxiaocs7 in #7143
- Core: Minor refactoring of PartitionsTable by @ajantha-bhat in #6975
- Build: Let RevAPI compare against 1.2.0 by @nastra in #7155
- MR: Remove deprecate AssertHelpers by @liuxiaocs7 in #7159
- Core: Remove deprecated validation APIs in MergingSnapshotProducer by @amogh-jahagirdar in #7150
- data: Remove AssertHelpers Usage by @liuxiaocs7 in #7134
- Flink:fix flink streaming query problem [ Cannot get a client from a closed pool] by @xuzhiwen1255 in #6614
- Spark 3.3: Remove use of deprecated SparkFilesScan by @szehon-ho in #7106
- Docs: Add
rest
to the catalog configuration by @Fokko in #7126 - Contributing Docs: Add section for testing code by @nastra in #7131
- Core, API: View Version implementation by @amogh-jahagirdar in #6861
- Update defaults of max-concurrent-file-group-rewrites to 5 by @karuppayya in #6907
- Flink: fixed Cloneable not implemented on CatalogLoader by @xuzhiwen1255 in #7168
- Core: Refactor actions results by @ajantha-bhat in #7089
- Docs: update doc to read easier by @joonsun-baek in #7167
- API: Fix retainAll and removeAll in CharSequenceSet by @zhongyujiang in #7133
- Spark 3.3: Support metadata column in the changelog table by @flyrain in #7152
- Spark 3.2: Support metadata column in the changelog table by @flyrain in #7178
- Flink: Backport #6614 to Flink 1.15 by @xuzhiwen1255 in #7165
- Core: Remove deprecated code from 1.2.0 by @nastra in #7156
- S3 Credentials provider support in DefaultAwsClientFactory #7063 by @dpaani in #7066
- Core: Move InMemoryCatalog from test to core by @nastra in #7185
- Doc: Retypeset the Flink document by @hililiwei in #7099
- Core: Store split offset for delete files by @singhpk234 in #7011
- Flink: Backport #6614 to Flink 1.14 by @xuzhiwen1255 in #7166
- Core, Hive: Support pluggable ClientPool by @lirui-apache in #6698
- AWS: Remove deprecated AssertHelpers by @liuxiaocs7 in #7195
- Spark: Support loading function as FunctionCatalog in SparkSessionCatalog by @bowenliang123 in #7153
- Flink: Implement data statistics operator to collect traffic distribution for guiding smart shuffling by @yegangy0718 in #6382
- Build: Move RevApi breakage to correct version by @nastra in #7223
- Ability to add multiple metrics reporters to scan by @karuppayya in #6919
- Spark 3.3: Use ProcedureInput in AncestorsOfProcedure by @aokolnychyi in #7177
- Core: Parse snapshot-id as long in remove-statistics update by @nastra in #7235
- Bump Nessie to 0.54.0 by @snazy in #7146
- Optimized spark vectorized read parquet decimal by @ConeyLiu in #3249
- Core: Optimize S3 layout of Datafiles by expanding first character set of the hash by @singhpk234 in #7128
- AWS: Prevent token refresh scheduling on every sign request by @nastra in #7270
- Disable local credentials if remote signing is enabled by @danielcweeks in #7230
- Spark: Revert "Spark: Add "Iceberg" prefix to SparkTable name string for SparkUI (#5629) by @amogh-jahagirdar in #7273
- Spark: broadcast table instead of file IO in rewrite manifests by @bryanck in #7263
- AWS: abort S3 input stream on close if not EOS by @bryanck in #7262
- Spark 3.2: Use ProcedureInput in AncestorsOfProcedure and AddFilesProcedure by @aokolnychyi in #7260
- Spark 3.3: Dataset writes for position deletes by @szehon-ho in #7029
- REST: fix previous locations for refs-only load by @bryanck in #7284
- Core: Fix flakiness in HadoopFileIOTest by @nastra in #7253
- Flink: Data statistics operator sends local data statistics to coordinator and receive aggregated data statistics from coordinator for smart shuffling by @yegangy0718 in #7269
- AWS: Make AuthSession cache static by @nastra in #7289
- Core: Require namespace when creating table using InMemoryCatalog by @nastra in #7252
- Refactor PartitionsTable planning by @dramaticlly in #7190
- Flink: Introduce Flink 1.17 by @hililiwei in #7254
- AWS: Check commit status after failed commit if AWS client performed retries by @ChristinaTech in #7198
- Core: Fix errorprone warning by @ajantha-bhat in #7286
- Bump Nessie to 0.56.0 by @snazy in #7283
- Build: Bump actions/stale from 7.0.0 to 8.0.0 by @dependabot in #7200
- Build: Bump org.apache.hadoop:hadoop-client from 3.3.4 to 3.3.5 by @dependabot in #7201
- Spark: apply rewrite manifest action fix to 3.1,3.2 by @bryanck in #7296
- Build: Spark version of
iceberg-delta-lake
to 3.3.2 by @doki23 in #7199 - Nessie: Use latest hash for catalog APIs by @ajantha-bhat in #6789
- Support vectorized reading int96 timestamps in imported data by @yabola in #6962
- Flink: Expose write-parallelism in SQL Hints by @hililiwei in #7039
- Nessie: Fix testcase failures by @ajantha-bhat in #7320
- Flink: move the classes from flink.sink.shuffle.statistics pkg to one level up as flink.sink.shuffle pkg by @stevenzwu in #7322
- Spark 3.3: Add doc for the changelog view procedure. by @flyrain in #7147
- Bump Nessie from 0.56.0 to 0.57.0 by @snazy in #7323
- Flink 1.15 1.17: Port Expose write-parallelism in SQL Hints to 1.15 & 1.17 by @hililiwei in #7327
- Update issue template for 1.2.1 release by @danielcweeks in #7331
- Core: Fix SnapshotProducer#targetBranch's exception message by @zhongyujiang in #7315
- Bump Gradle from 8.0.2 to 8.1 by @snazy in #7333
- Build: Fix flaky checkstyle issue by @ajantha-bhat in #7321
- [Infra] Update vote mail sample in source-release.sh by @gaborkaszab in #7330
- Core: Add missing metrics reporters when creating BaseTable by @nastra in #7341
- Core, Spark 3.3: Add FileRewriter API by @aokolnychyi in #7175
- Spark - Accept an
output-spec-id
that allows writing to a desired partition spec by @gustavoatt in #7120 - [ORC][Spark] - Support selected vector with ORC reader on the row and batch reader by @pavibhai in #7197
- Flink: use correct scan mode when in TABLE_SCAN_THEN_INCREMENTAL mode by @chenjunjiedada in #7338
- Throw NoSuchIcebergTableException instead of ValidationException in G… by @ericlgoodman in #7277
- Build: Bump Airlift from 0.21 to 0.24 by @Fokko in #7347
- Docs: clarify Hive on Tez con...
Apache Iceberg 1.2.1
Full Changelog: apache-iceberg-1.2.0...apache-iceberg-1.2.1