[SUPPORT] RLI index slowing down #11243

manishgaurav84 · 2024-05-16T05:06:46Z

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced
A MongoDB table is synced by AWS DMS CDC pipeline on s3 using Glue Job.
The job execution time increases after few runs by 50%.
Table Stats:

Number of records at Initial Run --> 530 M
Avg Number of records at Incremental Runs --> 5M inserts, 20K updates, 0 deletes
Hudi Jars Used:
hudi-spark3.3-bundle_2.12-0.14.0.jar
hudi-aws-0.14.0.jar,
httpclient-4.5.14.jar,
spark-avro_2.12-3.5.0.jar

To Reproduce

Steps to reproduce the behavior:

HUDI table configuration
'hoodie.table.name': 'appsflyerevents', 'hoodie.datasource.write.precombine.field': 'upsert_ts', 'hoodie.datasource.write.recordkey.field': 'oid__id', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'appsflyerevents', 'hoodie.datasource.hive_sync.database': 'origin', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': 'creation_month', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.partitionpath.field': 'creation_month', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS', 'hoodie.cleaner.fileversions.retained': 1, 'hoodie.upsert.shuffle.parallelism': 152, 'hoodie.index.type': 'RECORD_INDEX', 'hoodie.metadata.record.index.enable': 'true', 'hoodie.metadata.record.index.growth.factor': 10, 'hoodie.metadata.record.index.max.filegroup.count': 20000, 'hoodie.metadata.record.index.min.filegroup.count': 1000, 'hoodie.metadata.record.index.max.filegroup.size': 536870912, 'hoodie.metadata.enable': 'true', 'hoodie.parquet.small.file.limit': -1, 'hoodie.metadata.clean.async': 'true', 'hoodie.metadata.keep.min.commits': '4', 'hoodie.metadata.keep.max.commits': '5', 'hoodie.datasource.meta.sync.glue.metadata_file_listing': 'true'

Expected behavior

The execution time should remain consistent and is not expected increase, significantly.

Environment Description

Hudi version : 0.14
Spark version : 3.3
Hive version : NA
Hadoop version : NA
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No

Additional context

Please find the spark UI attached

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

manishgaurav84 · 2024-05-16T05:09:47Z

Spark UI files : Uploading DOC-20240516-WA0005.zip…

ad1happy2go · 2024-05-17T10:27:26Z

@manishgaurav84 Not sure why I couldn't download event logs. Can you ping me on slack and provide me there also.

manishgaurav84 · 2024-05-17T12:13:08Z

@ad1happy2go I have provided the logs on slack message.

soumilshah1995 · 2024-05-28T16:58:48Z

have you tried async way


spark-submit \
    --class org.apache.hudi.utilities.HoodieIndexer \
     --properties-file spark-config.properties \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
     --mode scheduleAndExecute \
    --base-path 's3a://huditest/hudidb/table_name=bronze_orders' \
    --table-name bronze_orders \
    --index-types RECORD_INDEX \
    --hoodie-conf "hoodie.metadata.enable=true" \
    --hoodie-conf "hoodie.metadata.record.index.enable=true" \
    --hoodie-conf "hoodie.metadata.index.async=true" \
    --hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
    --hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
    --parallelism 2 \
    --spark-memory 2g

codope added performance index labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] RLI index slowing down #11243

[SUPPORT] RLI index slowing down #11243

manishgaurav84 commented May 16, 2024

manishgaurav84 commented May 16, 2024 •

edited

ad1happy2go commented May 17, 2024

manishgaurav84 commented May 17, 2024

soumilshah1995 commented May 28, 2024

[SUPPORT] RLI index slowing down #11243

[SUPPORT] RLI index slowing down #11243

Comments

manishgaurav84 commented May 16, 2024

manishgaurav84 commented May 16, 2024 • edited

ad1happy2go commented May 17, 2024

manishgaurav84 commented May 17, 2024

soumilshah1995 commented May 28, 2024

manishgaurav84 commented May 16, 2024 •

edited