Reading large parquet file with parquet modular encryption fail #22703

dieu-nguyen · 2024-05-09T04:42:44Z

I use presto to read Parquet file in HDFS. The parquet file has enable Parquet modular encryption.
Reading small file is fine, but while reading large file, it fail at the decrypt function.
Presto show error: Query 20240509_030132_00001_r659k failed: GCM tag check failed

Your Environment

Presto version used: 0.283
Storage: HDFS
Data source and connector used: hive-hadoop2 connector, hive metastore, parquet file with PME, using InMemoryKMS
Deployment: On-prem
Link to the complete debug logs: presto_error_log

Expected Behavior

Data must be returned to client

Current Behavior

Fail while decrypt function

Possible Solution

TBD

Steps to Reproduce

Prepare data:

Data source: https://github.com/datablist/sample-csv-files?tab=readme-ov-file
- customers-2000000.csv
- customers-100000.csv
Download the data and adding encryption to file, store as .parquet file: create_pme_file.py.zip
- InMemoryKMS taken from this repo parquet-hadoop
- Encrypt 2 fields: Email, dict_col_1
- File size are:
- test_pme_file.zip

Put 2 files to HDFS, create hive external table

Create table query:

CREATE EXTERNAL TABLE `test_schema`.`customers`(
  `Index` string,
  `Customer Id` string,
  `First Name` string,
  `Last Name` string,
  `Company` string,
  `City` string,
  `Country` string,
  `Phone 1` string,
  `Phone 2` string,
  `Email` string,
  `Subscription Date` string,
  `Website` string,
  `dict_col_1` string,
  `dict_col_2` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://hdfshost/tmp/test/customers';


CREATE EXTERNAL TABLE `test_schema`.`customers_light`(
  `Index` string,
  `Customer Id` string,
  `First Name` string,
  `Last Name` string,
  `Company` string,
  `City` string,
  `Country` string,
  `Phone 1` string,
  `Phone 2` string,
  `Email` string,
  `Subscription Date` string,
  `Website` string,
  `dict_col_1` string,
  `dict_col_2` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://hdfshost/tmp/test/customers_light';

Start presto, with hive.properties and core-site.xml config file:

hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://hdfshost:9083
hive.config.resources=/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/core-site.xml,/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/hdfs-site.xml
hive.hdfs.impersonation.enabled=false
hive.hdfs.authentication.type=NONE
hive.parquet.use-column-names=true

core-site.xml

    <property>
        <name>parquet.encryption.kms.client.class</name>
        <value>org.apache.parquet.crypto.aws.InMemoryKMS</value>
    </property>
    <property>
        <name>parquet.encryption.key.list</name>
        <value>keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==</value>
    </property>
    <property>
        <name>parquet.crypto.factory.class</name>
        <value>org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory</value>
    </property>

Query data using presto-cli

./target/presto-cli-0.283-executable.jar

use hive.test_schema;
select * from test_schema.customers;
select * from test_schema.customers_light;

It is ok while query the column that is not encrypted:
Error while query the encrypted column:
.
But the small file table is completely fine:

Related stack trace: presto_error_log

Context

The text was updated successfully, but these errors were encountered:

dieu-nguyen · 2024-05-15T16:17:18Z

I try to testing the file size, it seem like the threshold is 128MB (as my HDFS block size setup), it means that file <=128MB is fine, >128MB is fail.
While file size > 128MB, it stored in multiple blocks in HDFS, will it make the dictionary page not available to the second data block?

dieu-nguyen · 2024-05-16T08:17:23Z

@shangxinli , please take a look

dieu-nguyen added the bug label May 9, 2024

tdcmeehan self-assigned this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading large parquet file with parquet modular encryption fail #22703

Reading large parquet file with parquet modular encryption fail #22703

dieu-nguyen commented May 9, 2024 •

edited

dieu-nguyen commented May 15, 2024

dieu-nguyen commented May 16, 2024

Reading large parquet file with parquet modular encryption fail #22703

Reading large parquet file with parquet modular encryption fail #22703

Comments

dieu-nguyen commented May 9, 2024 • edited

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

dieu-nguyen commented May 15, 2024

dieu-nguyen commented May 16, 2024

dieu-nguyen commented May 9, 2024 •

edited