Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns #11144

Open
ghost opened this issue May 3, 2024 · 3 comments
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas

Comments

@ghost
Copy link

ghost commented May 3, 2024

Hi.
I am developing a process to ingest data from my hdfs using Hudi. I want to partition the data using a custom keygenerator class where the partition key will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using the function module to send the row to a partition or another.

The initial load is the following:

spark.read.option("mergeSchema","true").parquet("PATH").
withColumn("_hoodie_is_deleted", lit(false)).
write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field","CID").
option("hoodie.datasource.write.precombine.field","sequential_total").
option("hoodie.datasource.write.new.columns.nullable", "true").
option("hoodie.datasource.write.reconcile.schema","true").
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option("hoodie.datasource.write.keygenerator.class","CustomKeyGenerator").
option("hoodie.datasource.write.partitionpath.field","CID@12").
option("hoodie.datasource.write.drop.partition.columns","true").
mode(Overwrite).
save("/tmp/hudi2")

I have added the property hoodie.datasource.write.drop.partition.columns because when I read the final path, hudi throws me the error: Cannot find columns: 'CID@12' in the schema
But with this property, It does not work either. The error that appears is the following:

org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch schema from the table
at org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175)
at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151)
at org.apache.hudi.BaseFileOnlyRelation.(BaseFileOnlyRelation.scala:69)
at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:321)
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:262)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:118)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 63 elided

@danny0405
Copy link
Contributor

hoodie.datasource.write.drop.partition.columns is setup by default as false, which means the data file does not include the partition columns, the partition field you declared here should be a field name instead of a value CID@12.

@ghost
Copy link
Author

ghost commented May 6, 2024

And is there any way of partitioning the data using a hash function of the row primary key to improve the performance for update rows. I have developed my custom BuiltinKeyGenerator overwriting the method getPartitionPath (I get the partitionPath, which is the primary key and I apply the operation % numBuckets) but the problem is that when I read the data, the value for the primary key column is the value of the operation instead of the real value.

@danny0405
Copy link
Contributor

The contract here is: the partition field shoud be in the table schema anyway.

@codope codope added the feature-enquiry issue contains feature enquiries/requests or great improvement ideas label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas
Projects
Status: 🏁 Triaged
Development

No branches or pull requests

2 participants