Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7146] Integrate secondary index on reader path #11162

Merged
merged 7 commits into from
Jun 7, 2024

Conversation

codope
Copy link
Member

@codope codope commented May 6, 2024

Change Logs

This PR integrates the secondary index on the reader path as follows:

  • Add HoodieTableMetadata#readSecondaryIndex API
  • Add SecondaryIndexSupport, a util class to get candidate files based on query filter and secondary index
  • Changes in HoodieFileIndex to use SecondaryIndexSupport

Impact

Can use secondary index in queries.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label May 6, 2024
@codope codope force-pushed the impl-sec-index-read branch 2 times, most recently from dc04695 to 49f3c3f Compare May 7, 2024 17:36
@codope codope force-pushed the impl-sec-index-read branch 2 times, most recently from 1c2c253 to 96b10a6 Compare May 20, 2024 11:06
@codope codope force-pushed the impl-sec-index-read branch 5 times, most recently from a602c9c to 3c52961 Compare May 30, 2024 17:09
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels May 30, 2024
@codope codope force-pushed the impl-sec-index-read branch 2 times, most recently from df12fa5 to b342d8f Compare June 3, 2024 07:34
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I have left some comments.

@codope codope force-pushed the impl-sec-index-read branch 3 times, most recently from 7e5f353 to 7a0a21f Compare June 5, 2024 18:05
@@ -253,7 +253,7 @@ public HoodieMetadataLogRecordReader build() {
}

private boolean shouldUseMetadataMergedLogRecordScanner() {
return PARTITION_NAME_SECONDARY_INDEX.equals(partitionName);
return partitionName.startsWith(PARTITION_NAME_SECONDARY_INDEX_PREFIX);
Copy link
Member Author

@codope codope Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is main fix in the latest commit. The issue was not caught in the testing previously because I was only asserting count of records. I have improved the test. Now, we assert the secondary index records as well as check that file pruning happens. Please see TestSecondaryIndexPruning

}


private def verifyFilePruning(opts: Map[String, String], dataFilter: Expression): Unit = {
Copy link
Member Author

@codope codope Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the method here and below just help with verifying file pruning happens correctly i.e. with data skipping enabled, filtered file count < data files in latest snapshot, and with data skipping disabled they should be equal.

@codope codope force-pushed the impl-sec-index-read branch 3 times, most recently from bb3bff3 to fe352bc Compare June 6, 2024 16:50
@@ -85,6 +85,7 @@ public void create(HoodieTableMetaClient metaClient, String indexName, String in
if (indexExists(metaClient, indexName)) {
throw new HoodieFunctionalIndexException("Index already exists: " + indexName);
}
checkArgument(columns.size() == 1, "Only one column can be indexed for functional or secondary index.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: HUDI-7836 to track the support

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, except a minor comment.

): Option[Set[String]] = {
val secondaryKeyConfigOpt = getSecondaryKeyConfig(queryReferencedColumns, metaClient)
if (secondaryKeyConfigOpt.isEmpty) {
Option.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Option.empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our scala checkstyle does not allow to add return -

<check level="error" class="org.scalastyle.scalariform.ReturnChecker" enabled="true"/>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, got it. But this judgment logic has no effect here when secondaryKey is empty, will continue to execute the latter code

}
lazy val (_, secondaryKeys) = if (isIndexAvailable) filterQueriesWithSecondaryKey(queryFilters, secondaryKeyConfigOpt.map(_._2)) else (List.empty, List.empty)
if (isIndexAvailable && queryFilters.nonEmpty && secondaryKeys.nonEmpty) {
val allFiles = fileIndex.inputFiles.map(strPath => new StoragePath(strPath)).toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use prunedPartitionsAndFileSlices replace allFiles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me do it in a followup PR

var fileIndex = HoodieFileIndex(spark, metaClient, None, commonOpts, includeLogFiles = true)
val filteredPartitionDirectories = fileIndex.listFiles(Seq(), Seq(dataFilter))
val filteredFilesCount = filteredPartitionDirectories.flatMap(s => s.files).size
assertTrue(filteredFilesCount < getLatestDataFilesCount(opts))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a new check assertTrue(filteredFilesCount > 0), in my test, it will be wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check this. I also need to refactor and put these in some test util class as it is used in other index test as well.

@hudi-bot
Copy link

hudi-bot commented Jun 7, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit e0cf4b6 into apache:master Jun 7, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index release-1.0.0-beta2 size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants