Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

Open
sutodi opened this issue May 9, 2024 · 1 comment
Labels

Comments

@sutodi
Copy link

sutodi commented May 9, 2024

On deleting a column from hudi table, its still present when querying from presto using hive connector. All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Your Environment

hudi-bundle: org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 to create a hudi table syncing to metastore
spark: 3.3.2

  • Presto version used: 0.281.1
  • Storage (HDFS/S3/GCS..): GCS
  • Data source and connector used: hive connector
  • Deployment (Cloud or On-prem): GCP dataproc
  • Pastebin link to the complete debug logs:

Expected Behavior

On deleting hudi column, i expected that column should not be present when querying from presto

Current Behavior

<All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Possible Solution

TBD

Steps to Reproduce

  1. Create dataproc cluster and make following changes
    spark-default.conf
    spark.serializer=org.apache.spark.serializer.KryoSerializer
    spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
    spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED
    spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED
    spark.sql.legacy.avro.datetimeRebaseModeInRead=CORRECTED

spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar

hive-site.xml
hive.metastore.disallow.incompatible.col.type.changes = false

  1. Create a table from pyspark in hudi:
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
    from datetime import datetime
    schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("surname", StringType(), True),
    StructField("ts", TimestampType(), True) # Adding timestamp field
    ])
    data = [
    (1, "John", "Doe", datetime.now()),
    (2, "Jane", "Smith", datetime.now()),
    (3, "Michael", "Johnson", datetime.now()),
    (4, "Emily", "Williams", datetime.now())
    ]

df = spark.createDataFrame(data, schema)
df = spark.createDataFrame(data, schema)
df.write
.format("org.apache.hudi")
.option("hoodie.table.name", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.option("hoodie.schema.on.read.enable","true")
.mode("overwrite")
.save("gs://xxxx/subham_test_metastore_13")
spark.sql("CREATE TABLE default.subham_test_metastore_13 USING hudi LOCATION 'gs://xxxx/subham_test_metastore_13' ")

  1. Create spark-sql engine as pyspark use datasource v1
    set hoodie.schema.on.read.enable=true

ALTER TABLE default.subham_test_metastore_11 DROP COLUMN surname;

On doing this, we get this error message as well, but data is getting dropped from hudi.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
4. update a row in table from pyspark
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("ts", TimestampType(), True) # Adding timestamp field
])
data = [
(1,"Johny", datetime.now())
]
df = spark.createDataFrame(data, schema)
df.write
.format("org.apache.hudi")
.option("hoodie.table.name", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.option("hoodie.schema.on.read.enable","true")
.mode("append")
.save("gs://xxxx/subham_test_metastore_13")

  1. Now, from spark-sql and pyspark, we can see that column is not coming but it is coming when querying from presto.

Screenshots (if appropriate)

Context

We have lakehouse in Hudi and use presto with hive-connector to query hudi table. We want to delete column and facing problem there.

@sutodi sutodi added the bug label May 9, 2024
@sutodi
Copy link
Author

sutodi commented May 9, 2024

@codope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

1 participant