[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns #11144

ghost · 2024-05-03T13:47:14Z

Hi.
I am developing a process to ingest data from my hdfs using Hudi. I want to partition the data using a custom keygenerator class where the partition key will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using the function module to send the row to a partition or another.

The initial load is the following:

spark.read.option("mergeSchema","true").parquet("PATH").
withColumn("_hoodie_is_deleted", lit(false)).
write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field","CID").
option("hoodie.datasource.write.precombine.field","sequential_total").
option("hoodie.datasource.write.new.columns.nullable", "true").
option("hoodie.datasource.write.reconcile.schema","true").
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option("hoodie.datasource.write.keygenerator.class","CustomKeyGenerator").
option("hoodie.datasource.write.partitionpath.field","CID@12").
option("hoodie.datasource.write.drop.partition.columns","true").
mode(Overwrite).
save("/tmp/hudi2")

I have added the property hoodie.datasource.write.drop.partition.columns because when I read the final path, hudi throws me the error: Cannot find columns: 'CID@12' in the schema
But with this property, It does not work either. The error that appears is the following:

org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch schema from the table
at org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175)
at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151)
at org.apache.hudi.BaseFileOnlyRelation.(BaseFileOnlyRelation.scala:69)
at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:321)
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:262)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:118)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 63 elided

danny0405 · 2024-05-06T00:58:27Z

hoodie.datasource.write.drop.partition.columns is setup by default as false, which means the data file does not include the partition columns, the partition field you declared here should be a field name instead of a value CID@12.

ghost · 2024-05-06T13:53:12Z

And is there any way of partitioning the data using a hash function of the row primary key to improve the performance for update rows. I have developed my custom BuiltinKeyGenerator overwriting the method getPartitionPath (I get the partitionPath, which is the primary key and I apply the operation % numBuckets) but the problem is that when I read the data, the value for the primary key column is the value of the operation instead of the real value.

danny0405 · 2024-05-07T01:49:33Z

The contract here is: the partition field shoud be in the table schema anyway.

codope added the feature-enquiry issue contains feature enquiries/requests or great improvement ideas label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns #11144

[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns #11144

ghost commented May 3, 2024

danny0405 commented May 6, 2024

ghost commented May 6, 2024

danny0405 commented May 7, 2024

[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns #11144

[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns #11144

Comments

ghost commented May 3, 2024

danny0405 commented May 6, 2024

ghost commented May 6, 2024

danny0405 commented May 7, 2024