On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

sutodi · 2024-05-09T08:17:51Z

On deleting a column from hudi table, its still present when querying from presto using hive connector. All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Your Environment

hudi-bundle: org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 to create a hudi table syncing to metastore
spark: 3.3.2

Presto version used: 0.281.1
Storage (HDFS/S3/GCS..): GCS
Data source and connector used: hive connector
Deployment (Cloud or On-prem): GCP dataproc
Pastebin link to the complete debug logs:

Expected Behavior

On deleting hudi column, i expected that column should not be present when querying from presto

Current Behavior

<All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Possible Solution

TBD

Steps to Reproduce

Create dataproc cluster and make following changes
spark-default.conf
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInRead=CORRECTED

spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar

hive-site.xml
hive.metastore.disallow.incompatible.col.type.changes = false

Create a table from pyspark in hudi:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
from datetime import datetime
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("surname", StringType(), True),
StructField("ts", TimestampType(), True) # Adding timestamp field
])
data = [
(1, "John", "Doe", datetime.now()),
(2, "Jane", "Smith", datetime.now()),
(3, "Michael", "Johnson", datetime.now()),
(4, "Emily", "Williams", datetime.now())
]

df = spark.createDataFrame(data, schema)
df = spark.createDataFrame(data, schema)
df.write
.format("org.apache.hudi")
.option("hoodie.table.name", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.option("hoodie.schema.on.read.enable","true")
.mode("overwrite")
.save("gs://xxxx/subham_test_metastore_13")
spark.sql("CREATE TABLE default.subham_test_metastore_13 USING hudi LOCATION 'gs://xxxx/subham_test_metastore_13' ")

Create spark-sql engine as pyspark use datasource v1
set hoodie.schema.on.read.enable=true

ALTER TABLE default.subham_test_metastore_11 DROP COLUMN surname;

On doing this, we get this error message as well, but data is getting dropped from hudi.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
4. update a row in table from pyspark
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("ts", TimestampType(), True) # Adding timestamp field
])
data = [
(1,"Johny", datetime.now())
]
df = spark.createDataFrame(data, schema)
df.write
.format("org.apache.hudi")
.option("hoodie.table.name", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.option("hoodie.schema.on.read.enable","true")
.mode("append")
.save("gs://xxxx/subham_test_metastore_13")

Now, from spark-sql and pyspark, we can see that column is not coming but it is coming when querying from presto.

Screenshots (if appropriate)

Context

We have lakehouse in Hudi and use presto with hive-connector to query hudi table. We want to delete column and facing problem there.

sutodi · 2024-05-09T08:18:05Z

@codope

sutodi added the bug label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

sutodi commented May 9, 2024

sutodi commented May 9, 2024

On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

Comments

sutodi commented May 9, 2024

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Screenshots (if appropriate)

Context

sutodi commented May 9, 2024