Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equality delete lost after compact data files #10312

Open
CodingJun opened this issue May 11, 2024 · 7 comments
Open

Equality delete lost after compact data files #10312

CodingJun opened this issue May 11, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@CodingJun
Copy link

CodingJun commented May 11, 2024

Apache Iceberg version

1.5.1

Query engine

Spark

Please describe the bug 馃悶

I have a program that continuously write streaming data to iceberg, and regularly use spark to compact data files. But I found that after compact the data files, some of the data was not deleted correctly. The following are the examples to reproduce:

Original table:

id value
1 a
2 b
3 c

Writing process:

  • t1: Thread 1 start compact data files with RewriteDataFilesSparkAction. (start snapshot-id: 1, start sequence-number: 1)
  • t2: Thread 2 write equality delete, id = 2. (snapshot-id: 2, sequence-number: 2)
  • t3: Thread 2 append new data, [4, d]. (snapshot-id: 3, sequence-number: 3)
  • t4: Thread 1 compact data files completed. (snapshot-id: 4, sequence-number: 4)

Result:

id value
1 a
2 b
3 c
4 d

The correct result should be:

id value
1 a
3 c
4 d

PS:

When I set use-starting-sequence-number = false for rewriteDataFiles, Thread 1 compact data files failed at t4. stacktrace:

Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot commit, found new delete for replaced data file: GenericDataFile{content=data, file_path=/var/folders/5z/dqrlv_ts0wqf36vd39bb384h0000gn/T/junit17491575750166086656/9f77fae8-d62a-426d-971f-a342b6775c44/test_schema/test_table/data/00000-2-52ae94aa-b796-4c42-bf9c-92d36c52e522-00001.parquet, file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=1, file_size_in_bytes=407, column_sizes=null, value_counts=org.apache.iceberg.util.SerializableMap@0, null_value_counts=org.apache.iceberg.util.SerializableMap@1, nan_value_counts=org.apache.iceberg.util.SerializableMap@0, lower_bounds=org.apache.iceberg.SerializableByteBufferMap@e1782, upper_bounds=org.apache.iceberg.SerializableByteBufferMap@e1782, key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=null}
	at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:50)
	at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:418)
	at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:367)
	at org.apache.iceberg.BaseRewriteFiles.validate(BaseRewriteFiles.java:108)
	at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:175)
	at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:296)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190)
	at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:295)
	at org.apache.iceberg.actions.RewriteDataFilesCommitManager.commitFileGroups(RewriteDataFilesCommitManager.java:89)
	at org.apache.iceberg.actions.RewriteDataFilesCommitManager.commitOrClean(RewriteDataFilesCommitManager.java:110)
	at org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction.doExecute(RewriteDataFilesSparkAction.java:291)
	... 8 more

Question:

Why are the equality deleted files lost? Is this correct or a bug?

@CodingJun CodingJun added the bug Something isn't working label May 11, 2024
@lurnagao-dahua
Copy link
Contributor

Is there any error log for equality delete?

@CodingJun
Copy link
Author

Is there any error log for equality delete?

No error, If I read directly from snapshot id: 3, the result is correct.

@CodingJun
Copy link
Author

I found the code to drop the equality delete files here.

public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) {
// filter any existing manifests
List<ManifestFile> filtered =
filterManager.filterManifests(
SnapshotUtil.schemaFor(base, targetBranch()),
snapshot != null ? snapshot.dataManifests(ops.io()) : null);
long minDataSequenceNumber =
filtered.stream()
.map(ManifestFile::minSequenceNumber)
.filter(
seq ->
seq
!= ManifestWriter
.UNASSIGNED_SEQ) // filter out unassigned in rewritten manifests
.reduce(base.lastSequenceNumber(), Math::min);
deleteFilterManager.dropDeleteFilesOlderThan(minDataSequenceNumber);
List<ManifestFile> filteredDeletes =
deleteFilterManager.filterManifests(
SnapshotUtil.schemaFor(base, targetBranch()),
snapshot != null ? snapshot.deleteManifests(ops.io()) : null);

  • TableMetadata base is refreshed, base.lastSequenceNumber() is 3.
  • The filtered manifest files minSequenceNumber is not equal to UNASSIGNED_SEQ, only the newly append file at t3.
  • So the minDataSequenceNumber is also 3, the manifest for equality delete will be droped.

I think the base.lastSequenceNumber() should be 1 at the start snapshot instead of 3.

@CodingJun CodingJun changed the title Equality delete files lost after compact data files Equality delete lost after compact data files May 11, 2024
@pvary
Copy link
Contributor

pvary commented May 11, 2024

@CodingJun: Your analysis seems correct to me. We need to take the minDataSequenceNumber and startingSequenceNumber.

@RussellSpitzer and @aokolnychyi might know more.

@lurnagao-dahua
Copy link
Contributor

lurnagao-dahua commented May 11, 2024

your process is in use-starting-sequence-number = true ?
I test with use-starting-sequence-number = true and compact failed(apache iceberg1.4.3):
Exception in thread "main" org.apache.iceberg.exceptions.ValidationException: Cannot commit, found new delete for replaced data file: GenericDataFile ...

@CodingJun
Copy link
Author

your process is in use-starting-sequence-number = true ? I test with use-starting-sequence-number = true and compact failed(apache iceberg1.4.3): Exception in thread "main" org.apache.iceberg.exceptions.ValidationException: Cannot commit, found new delete for replaced data file: GenericDataFile ...

Yes, The default setting is true. Can you debug it to see if the configuration is effective?

@CodingJun
Copy link
Author

Do you know if this is a bug? @RussellSpitzer @aokolnychyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants