Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading large parquet file with parquet modular encryption fail #22703

Open
dieu-nguyen opened this issue May 9, 2024 · 2 comments
Open

Reading large parquet file with parquet modular encryption fail #22703

dieu-nguyen opened this issue May 9, 2024 · 2 comments
Assignees
Labels

Comments

@dieu-nguyen
Copy link

dieu-nguyen commented May 9, 2024

I use presto to read Parquet file in HDFS. The parquet file has enable Parquet modular encryption.
Reading small file is fine, but while reading large file, it fail at the decrypt function.
Presto show error: Query 20240509_030132_00001_r659k failed: GCM tag check failed

Your Environment

  • Presto version used: 0.283
  • Storage: HDFS
  • Data source and connector used: hive-hadoop2 connector, hive metastore, parquet file with PME, using InMemoryKMS
  • Deployment: On-prem
  • Link to the complete debug logs: presto_error_log

Expected Behavior

Data must be returned to client

Current Behavior

Fail while decrypt function

Possible Solution

TBD

Steps to Reproduce

  1. Prepare data:
  1. Put 2 files to HDFS, create hive external table
  • image
  • Create table query:
CREATE EXTERNAL TABLE `test_schema`.`customers`(
  `Index` string,
  `Customer Id` string,
  `First Name` string,
  `Last Name` string,
  `Company` string,
  `City` string,
  `Country` string,
  `Phone 1` string,
  `Phone 2` string,
  `Email` string,
  `Subscription Date` string,
  `Website` string,
  `dict_col_1` string,
  `dict_col_2` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://hdfshost/tmp/test/customers';


CREATE EXTERNAL TABLE `test_schema`.`customers_light`(
  `Index` string,
  `Customer Id` string,
  `First Name` string,
  `Last Name` string,
  `Company` string,
  `City` string,
  `Country` string,
  `Phone 1` string,
  `Phone 2` string,
  `Email` string,
  `Subscription Date` string,
  `Website` string,
  `dict_col_1` string,
  `dict_col_2` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://hdfshost/tmp/test/customers_light';
  1. Start presto, with hive.properties and core-site.xml config file:
  • hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hdfshost:9083
hive.config.resources=/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/core-site.xml,/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/hdfs-site.xml
hive.hdfs.impersonation.enabled=false
hive.hdfs.authentication.type=NONE
hive.parquet.use-column-names=true
  • core-site.xml
    <property>
        <name>parquet.encryption.kms.client.class</name>
        <value>org.apache.parquet.crypto.aws.InMemoryKMS</value>
    </property>
    <property>
        <name>parquet.encryption.key.list</name>
        <value>keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==</value>
    </property>
    <property>
        <name>parquet.crypto.factory.class</name>
        <value>org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory</value>
    </property>
  1. Query data using presto-cli
./target/presto-cli-0.283-executable.jar

use hive.test_schema;
select * from test_schema.customers;
select * from test_schema.customers_light;
  • It is ok while query the column that is not encrypted: image
  • Error while query the encrypted column:
    image.
  • But the small file table is completely fine:
image image

Context

@dieu-nguyen dieu-nguyen added the bug label May 9, 2024
@dieu-nguyen
Copy link
Author

I try to testing the file size, it seem like the threshold is 128MB (as my HDFS block size setup), it means that file <=128MB is fine, >128MB is fail.
While file size > 128MB, it stored in multiple blocks in HDFS, will it make the dictionary page not available to the second data block?

@dieu-nguyen
Copy link
Author

@shangxinli , please take a look

@tdcmeehan tdcmeehan self-assigned this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 📋 Prioritized Backlog
Development

No branches or pull requests

2 participants