-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT]xxx.parquet is not a Parquet file #11178
Comments
@MrAladdin There is a related fix here - https://github.com/apache/hudi/pull/10883/files |
The 0.14.1 version does not have hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/LSMTimelineWriter.java. |
@MrAladdin Yes, you are correct. This may applies to LSMTimelineWriter. @danny0405 Any idea here? |
I'm wondering how the table got written, is it written by Flink streaming pipeline? |
Spark Structured Streaming When using the record_index type index to upsert an MOR type table, this exception suddenly occurred, leading to the downstream being unable to perform data reading. Other tables constructed in the same manner have not yet experienced this exception. Asynchronous compaction has been enabled within Spark Structured Streaming. |
@MrAladdin Can you please share the timeline and writer configurations. |
df
1、In fact, there is only one writing program, and all table services are completed within the structured writing program. Just discovered that in .option(RECORDKEY_FIELD.key(), "records_key"), the records_key is unique under each partition, and only a very small number of data instances will have the same records_key but in different partitions. Since record_index is a global index, is this the reason that causes the exception during upsert? Thanks |
@ad1happy2go I need your help to answer the question I replied to you above, thank you. |
|
1.The problem occasionally encountered in version 0.12, the solution is to delete the damaged files with the command hadoop fs -rm -r. Now, after upgrading, this issue appears for the first time in version 0.14. |
@xushiyan I need your help to answer the question I replied to you above, thank you. 2、I have a question: When using Spark Structured Streaming to write data, the number of hfile files under .hoodie/metadata/record_index is twice the amount set by .option("hoodie.metadata.record.index.min.filegroup.count", "720"), but when using offline Spark DataFrame for batch data writing, each submission will generate a corresponding number of hfile, leading to an excessively large number of hfiles under record_index. What is the reason for this, and how can we better control the number of hfile files under .hoodie/metadata/record_index and what is the most reasonable setting for the size of each hfile? Also, what are the specific parameter names involved? |
1 Not sure about the root cause or any scenario what can cause this. |
Describe the problem you faced
A clear and concise description of the problem.
Environment Description
Hudi version :0.14.1
Spark version :3.4
Hive version :3.1.2
Hadoop version :3.1
Storage (HDFS/S3/GCS..) :hdfs
Running on Docker? (yes/no) :no
Stacktrace
Caused by: java.lang.RuntimeException: viewfs://nbns/user/quantum_social/lakehouse/social/dwd_social_kbi_beauty_lower_v1/partition_index_date=202302/229164d5-911f-49df-91b5-cb15aecc60de-0_2531-32510-4568658_20240508183714815.parquet is not a Parquet file (length is too low: 0)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:540)
at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:777)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39)
at org.apache.spark.sql.execution.datasources.parquet.Spark34LegacyHoodieParquetFileFormat.footerFileMetaData$lzycompute$1(Spark34LegacyHoodieParquetFileFormat.scala:184)
at org.apache.spark.sql.execution.datasources.parquet.Spark34LegacyHoodieParquetFileFormat.footerFileMetaData$1(Spark34LegacyHoodieParquetFileFormat.scala:183)
at org.apache.spark.sql.execution.datasources.parquet.Spark34LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark34LegacyHoodieParquetFileFormat.scala:187)
at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67)
at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:582)
at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:673)
at org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:249)
at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered: