Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] RLI index slowing down #11243

Open
manishgaurav84 opened this issue May 16, 2024 · 4 comments
Open

[SUPPORT] RLI index slowing down #11243

manishgaurav84 opened this issue May 16, 2024 · 4 comments

Comments

@manishgaurav84
Copy link

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced
A MongoDB table is synced by AWS DMS CDC pipeline on s3 using Glue Job.
The job execution time increases after few runs by 50%.
Table Stats:

  1. Number of records at Initial Run --> 530 M
  2. Avg Number of records at Incremental Runs --> 5M inserts, 20K updates, 0 deletes
  3. Hudi Jars Used:
    hudi-spark3.3-bundle_2.12-0.14.0.jar
    hudi-aws-0.14.0.jar,
    httpclient-4.5.14.jar,
    spark-avro_2.12-3.5.0.jar

To Reproduce

Steps to reproduce the behavior:

HUDI table configuration
'hoodie.table.name': 'appsflyerevents', 'hoodie.datasource.write.precombine.field': 'upsert_ts', 'hoodie.datasource.write.recordkey.field': 'oid__id', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'appsflyerevents', 'hoodie.datasource.hive_sync.database': 'origin', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': 'creation_month', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.partitionpath.field': 'creation_month', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS', 'hoodie.cleaner.fileversions.retained': 1, 'hoodie.upsert.shuffle.parallelism': 152, 'hoodie.index.type': 'RECORD_INDEX', 'hoodie.metadata.record.index.enable': 'true', 'hoodie.metadata.record.index.growth.factor': 10, 'hoodie.metadata.record.index.max.filegroup.count': 20000, 'hoodie.metadata.record.index.min.filegroup.count': 1000, 'hoodie.metadata.record.index.max.filegroup.size': 536870912, 'hoodie.metadata.enable': 'true', 'hoodie.parquet.small.file.limit': -1, 'hoodie.metadata.clean.async': 'true', 'hoodie.metadata.keep.min.commits': '4', 'hoodie.metadata.keep.max.commits': '5', 'hoodie.datasource.meta.sync.glue.metadata_file_listing': 'true'

Expected behavior

The execution time should remain consistent and is not expected increase, significantly.

Environment Description

  • Hudi version : 0.14

  • Spark version : 3.3

  • Hive version : NA

  • Hadoop version : NA

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : No

Additional context

Please find the spark UI attached

Stacktrace

Add the stacktrace of the error.

@manishgaurav84
Copy link
Author

manishgaurav84 commented May 16, 2024

Spark UI files : Uploading DOC-20240516-WA0005.zip…

@ad1happy2go
Copy link
Contributor

@manishgaurav84 Not sure why I couldn't download event logs. Can you ping me on slack and provide me there also.

@manishgaurav84
Copy link
Author

@ad1happy2go I have provided the logs on slack message.

@soumilshah1995
Copy link

have you tried async way


spark-submit \
    --class org.apache.hudi.utilities.HoodieIndexer \
     --properties-file spark-config.properties \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
     --mode scheduleAndExecute \
    --base-path 's3a://huditest/hudidb/table_name=bronze_orders' \
    --table-name bronze_orders \
    --index-types RECORD_INDEX \
    --hoodie-conf "hoodie.metadata.enable=true" \
    --hoodie-conf "hoodie.metadata.record.index.enable=true" \
    --hoodie-conf "hoodie.metadata.index.async=true" \
    --hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
    --hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
    --parallelism 2 \
    --spark-memory 2g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting Triage
Development

No branches or pull requests

4 participants